require(Hmisc)
getRs('reptools.r', put='source')
getRs('movStats.r', put='source')R Workflow
This article outlines analysis project workflow that I’ve found to be efficient in making reproducible research reports using R with Rmarkdown and now Quarto. I start by covering importing data, creating annotated analysis files, examining extent and patters of missing data, and running descriptive statistics on them with goals of understanding the data and their quality and completeness. Functions in the Hmisc package are used to annotate data frames and data tables with labels and units of measurement, show metadata/data dictionaries, and to produce tabular and graphical statistical summaries. Efficient and clear methods of recoding variables are given. Several examples of processing and manipulating data using the data.table package are given, including some non-trivial longitudinal data computations. General principles of data analysis are briefly surveyed and some flexible bivariate and 3-variable analysis methods are presented with emphasis on staying close to the data while avoiding highly problematic categorization of continuous independent variables. Examples of diagramming the flow of exclusion of observations from analysis, caching results, parallel processing, and simulation are presented. In the process several useful report writing methods are exemplified, including program-controlled creation of multiple report tabs.
This report makes heavy use of the following R packages and Github repository:
Hmiscpackage which contains functions for importing data, data annotation, summary statistics, statistical graphics, advanced table making, etc.data.tablepackage for data storage, retrieval, manipulation, munging, aggregation, merging, and reshapinghavenpackage for importing datasets from statistical packagesggplot2package for static graphicsconsortpackage for consort diagramsplotlypackage for interactive graphicsconsortpackage for consort diagrams showing observation filteringrmspackage for statistical modeling, validation, and presentationknitrpackage for running reproducible reports, and also providingkableandkablesfunctions for simple html table printingrscriptsGithub repository with utility functions that are all loaded whenreptools.ris loadedaddCap,printCapfor adding captions to a list of figures and for printing the listdataChkfor data checkingdataOverviewdataset overview
hashCheckfor checking if parent objects have changed so a slow analysis has to be re-run (i.e., talking control of caching)htmlListto easily print vectors in a named list usingkablehtmlView,htmlViewxfor viewing data dictionaries/metadata in browser windowskablto make it easy to usekableandkablesfor making html tablesmaketabsto automatically make multiple tabs inQuartoreports, each tab holding the output of one or more R commandmakecolmargto print an object in the right margin inQuartoreportsmakecnoteto print an object in a collapsibleQuartonotemakecallouta generic Quarto callout maker called bymakecolmarg,makecnotemakecodechunk
makemermaidmake Quartomermaiddiagrams with insertion of variable valuesrsHelpfor viewing helps files for functions inrscriptsscplotfor putting graphs in separate chunks with captions in TOCseqFreqfor creating a factor variable with categories in descending order of sequential frequencies of conditions (as used in computing study exclusion counts)vClusfor variable clustering
runifChangedwhich useshashCheckto automatically re-run an analysis if needed, otherwise to retrieve previous results efficiently
qreportGithub repository which has functions dedicated to randomized clinical trial reportingaePlotfor making an interactiveplotlydot chart of adverse event proportions by treatment
Another file in rscripts is movStats.r which defines the movStats function for computing summary statistics by moving overlapping windows of a continuous variable, or simply stratified by a categorical variable. The rscripts Github functions are accessed by the Hmisc function getRs, e.g.
All the available help files for functions in rscripts are at hbiostat.org/R/rscripts. To view a help file for one of the functions in the RStudio Viewer pane use for example rsHelp(movStats) or rsHelp(reptools).
Assignment Operator
You assign an R object to a value using the assignment operator <- or the equal sign. <- is read as “gets”.
x <- y
d <- read.csv('mydata.csv')
x = yObject Types
Everything in R is an object. Some primitive types of objects in R are below.
| Type | Meaning |
|---|---|
| integer | whole numbers |
| logical | values of TRUE or FALSE |
| double | floating point non-whole numbers |
| character | character strings |
| function | code defining a function |
In the table below, objects of different shapes are described. rows and cols refers to vectors of integers or logicals, or if the elements of the object are named, character strings.
| Type | Example | Values Retrieved By |
|---|---|---|
| scalar | x <- 3 |
x |
| vector | y <- c(1, 2, 5) |
y[2] (2), y[2:3] (2, 5), y[-1] (2, 5), y[c(TRUE,FALSE,TRUE)] (1) |
| named vector | y <- c(a=1, b=2, d=5) |
y[2] (2), y['b'] (2), y[c('a','b')] (1, 2) |
| matrix | y <- cbind(1:3, 4:5) |
y[rows,cols], y[rows,] (all cols), y[,cols] (all rows) |
| list | x <- list(a='cat', b=c(1,3,7)) |
x$a (‘cat’), x[[1]] (‘cat’), x[['a']] (‘cat’) |
Named vectors provide an extremely quick table lookup and recoding capability.
list objects are arbitrary trees and can have elements nested to any level. You can have lists of lists or lists of data frames/tables.
Vectors can be of many different types when a class is added to them. Two of the most common are Dates and factors. Character strings are handled very efficiently in R so there is not always a need to store categorical variables as factors. But there is one reason: to order levels, i.e., distinct variable values, so that tabular and graphical output will list values in a more logical order than alphabetic. A factor variable has a levels attribute added to it to accomplish this. An example is x <- factor(x, 1:3, c('cat', 'dog', 'fox')) where the second argument 1:3 is the vector of possible numeric values x currently takes on (in order) and the three character strings are the corresponding levels. Internally factors are coded as integers, but they print as character strings.
Rectangular data objects, i.e., when the number of rows is the same for every column (variable), can be represented by matrices, data.frames, and data.tables. In a matrix, every value is of the same type. A data.frame or a data.table is an R list that can have mixtures of numeric, character, factor, dates, and other object types. A data.table is also a data.frame but the converse isn’t true. data.tables are handled by the R data.table package and don’t have row names but can be indexed, are much faster to process, and have a host of methods implemented for aggregation and other operations. data.frames are handled by base R.
Data frames are best managed by converting them to data tables and using the data.table package. When data.table is not used there are three indispensable functions for operating on data frames:
withfor analyzing variables within a data frame without constantly prefixing variable names withdataframename$transformfor adding or changing variables within a data frameHmiscupDatafunction for doing the same astransformbut also allowing metadata to be added to the data, e.g., variable labels and units (to be discussed later)
Here are some examples of with and transform.
# Better than mean(mydata$systolic.bp - mydata$diastolic.bp) :
with(mydata, mean(systolic.bp - diastolic.bp))
# Better than mydata$pulse.pressure <- mydata$systolic.bp - mydata$diastolic.bp:
mydata <- transform(mydata,
pulse.pressure = systolic.bp - diastolic.bp,
bmi = wt / ht ^ 2)
# Perform several operations on the same data frame
with(mydata, {
x3 <- x1 / sqrt(x2)
ols(y ~ x3)
} )Subscripting
Examples of subscripting are given above. Subscripting via placement of [] after an object name is used for subsetting, and occasionally for using some elements more than once:
x <- c('cat', 'dog', 'fox')
x[2:3][1] "dog" "fox"
x[c(1, 1, 3, 3, 2)][1] "cat" "cat" "fox" "fox" "dog"
Subscripting a variable or a data frame/table by a vector of TRUE/FALSE values is a very powerful feature of R. This is used to obtain elements satisfying one or more conditions:
x <- c(1, 2, 3, 2, 1, 4, 7)
y <- c(1, 8, 2, 3, 8, 9, 2)
x[y > 7][1] 2 1 4
The last line of code can be read as “values of x such that y > 7”.
Branching and If/Then
Decisions Based on One Scalar Value
Common approaches to this problem are if and switch.
type <- 'semiparametric'
f <- switch(parametric = ols(y ~ x),
semiparametric = orm(y ~ x),
nonparametric = rcorr(x, y, type='spearman'),
{ z <- y / x
c(median=median(z), gmean=exp(mean(log(z)))) } )
# The last 2 lines are executed for any type other than the 3 listed
f <- if(type == 'parametric') ols(y ~ x)
else
if(type == 'semiparametric') orm(y ~ x)
else
if(type == 'nonparametric') rcorr(x, y, type='spearman')
else {
z <- y / z
c(median=median(z), gmean=exp(mean(log(z)))
}What is inside if( ) must be a single scalar element that is evaluated to whether it’s TRUE or FALSE.
Series of Separate Decisions Over a Vector of Values
The ifelse or data.table::fifelse functions are most often used for this, but data.table::fcase is a little better. Here’s an example.
x <- c('cat', 'dog', 'giraffe', 'elephant')
type <- ifelse(x %in% c('cat', 'dog'), 'domestic', 'wild')
type[1] "domestic" "domestic" "wild" "wild"
require(data.table)
fcase(x %in% c('cat', 'dog'), 'domestic', default='wild')[1] "domestic" "domestic" "wild" "wild"
if Trick
Sometimes when constructing variable-length vectors and other objects, elements are to be included in the newly constructed object only when certain conditions apply. When a condition does not apply, no element is to be inserted. We can capitalize on the fact that the result of if(...) is NULL when ... is not TRUE, and concatenating NULL results in ignoring it. Here are two examples. In the first the resulting vector will have length 2, 3, or 4 depending on sex and height. In the second example the new vector will have the appropriate element names preserved.
y <- 23; z <- 46; sex <- 'female'; height <- 71; u <- pi; w <- 7
c(y, z, if(sex == 'male') u, if(height > 70) w)[1] 23 46 7
c(x1=3, if(sex == 'male') c(x2=4), if(height > 70) c(x3=height))x1 x3
3 71
# reduce clutter in case of variable name conflicts:
rm(y, z, sex, height, u, w)Functions
There are so many functions in R that it may be better to use the stackoverflow.com Q&A to find the ones you need (as of 2022-05-26 there are 450,000 R questions there). Here are just a few of the multitude of handy R functions. The first functions listed below return the R missing value NA if any element is missing. You can specify na.rm=TRUE to remove NAs from consideration first, so they will not cause the result to be NA. Must functions get their arguments (inputs) in () after the function name. Some functions like %in% are binary operators whose two arguments are given on the left and right of %in%.
mean,median,quantile,var,sd: compute statistical summaries on one vectormin, max: minimum or maximum of values in a vector or of multiple variables, resulting in one numberpmin, pmax: parallel minimum and maximum for vectors, resulting in a vector. Example:pmin(x, 3)returns a vector of the same length asx. Each element is the minimum of the original value or 3.unique: return vector of distinct values, in same order as original valuesunion,intersect,setdiff,setequal: set operations on two vectors (see below)a %in% b,a %nin% b: set membership functions that determine whether each element inais inb(for%in%) or is not inb(for%nin%, which is in theHmiscpackage)
Set operators are amazingly helpful. Here are some examples.
unique(x) # vector of distinct values of x, including NA if occurred
sort(unique(x)) # distinct values in ascending order
setdiff(unique(x), NA) # distinct values excluding NA if it occurred
duplicated(x) # returns TRUE for elements that are duplicated by
# values occurring EARLIER in the list
union(x, y) # find all distinct values in the union of x & y
intersect(x, y) # find all distinct values in both x & y
setdiff(x, y) # find all distinct x that are not in y
setequal(x, y) # returns TRUE or FALSE depending on whether the distinct
# values of x and y are identical, ignoring how they
# are ordered
# Find list of subject ids that are found in baseline but not
# in follow-up datasets
idn <- setdiff(baseline$id, followup$id)
# Don't do this:
# if(animal == 'cat' | animal == 'dog') ....
# But use %in% instead:
# if(animal %in% c('cat', 'dog')) ...
# Likewise don't say if(animal != 'cat' & animal != 'dog') but
# use if(animal %nin% c('cat', 'dog')) ...Even new R users can benefit from writing functions to reduce repetitive coding. A function has arguments and these can have default values for when the argument is not specified by the user when the function is called. Here are some examples. One line functions do not need to have their bodies enclosed in {}.
cuberoot <- function(x) x ^ (1/3)
cuberoot(8)[1] 2
g <- function(x, power=2) {
u <- abs(x - 0.5)
u / (1. + u ^ power)
}
g(3, power=2)[1] 0.3448276
g(3)[1] 0.3448276
# Function to make mean() drop missing values without our telling it
mn <- function(x) mean(x, na.rm=TRUE)
# Function to be used throughout the report to round fractional values
# by a default amount (here round to 0.001)
rnd <- function(x) round(x, 3)
# edit the 3 the change rounding anywhere in the report
# The following simple function saves coding when you need to recode multiple
# variables from 0/1 to no/yes.
yn <- function(x) factor(x, 0:1, c('no', 'yes'))1 Report Formatting
A state-of-the-art way to make reproducible reports is to use a statistical computing language such as R and its knitr package in conjunction with either RMarkdown or Quarto, with the latter likely to replace the former. Both of the report-making systems allow one to produce reports in a variety of formats including html, pdf, and Word. Html is recommended because pages can be automatically resized to allow optimum viewing on devices of most sizes, and because html allows for interactive graphics and other interactive components. Pdf is produced by converting RMarkdown or Quarto-produced markdown elements to \(\LaTeX\).
This document can serve as a template for using R with Quarto; one can see the raw script by clicking on Code at the top right of the report. When one has only one output format target, things are fairly straightforward except some situations where mixed formats are rendered in the same code chunk. Click below for details.
To make use of specialized functions that produce html or \(\LaTeX\) markup, one often has to put results='asis' in the code chunk header to keep the system from disturbing the generated html or \(\LaTeX\) markup so that it will be typeset correctly in the final document. This process works smoothly but creates one complication: if you print an object that produces plain text in the same code chunk, the system will try to typeset it in html or \(\LaTeX\). To prevent this from happening you either need to split the chunks into multiple chunks (some with results='asis' and some not) or you need to make it clear that parts of the output are to be typeset verbatim. To do that a simple function pr can sense if results='asis' is in effect for the current chunk. If so, the object is surrounded by the markdown verbatim indicator—three consecutive back ticks. If not the object is left alone. pr is defined in the marksupSpecs$markdown$pr object, so you can bring it to your session by copying into a local function pr as shown below, which has a chunk option results='asis' to show that verbatim output appears anyway. If the argument obj to pr is a data frame or data table, variables will be rounded to the value given in the argument dec (default dec=3) before printing. If you specify inline=x the object x is printed with cat() instead of print(). inline is more for printing character strings.
An example of something that may not render correctly due to results='asis' being in the chunk header (needed for html(...)):
options(prType='html')
f <- ols(y ~ rcs(x1, 5))
f # prints model summary in html format
m <- matrix((1:10)/3, ncol=2)
m
# use pr(obj=m) to fixHere are examples of pr usage.
require(Hmisc)
pr <- markupSpecs$markdown$pr
x <- (1:5)/7
pr('x:', x)
x:
[1] 0.1428571 0.2857143 0.4285714 0.5714286 0.7142857
pr(obj=x)[1] 0.1428571 0.2857143 0.4285714 0.5714286 0.7142857
pr(inline=paste(round(x,3), collapse=', '))
0.143, 0.286, 0.429, 0.571, 0.714
Instead of working to keep certain outputs verbatim you can use knitr::kable() to convert verbatim output to markdown.
1.1 Quarto Syntax for Figures
One can specify sizes, layouts, captions, and more using Quarto markup. Captions are ignored unless a figure is given a label. Figure labels must begin with fig-. The figure can be cross-referenced elsewhere in the document using for example See figure \@fig-scatterplot. Here is example syntax.
```{r}
#| label: fig-myplot
#| fig-cap: An example caption (use one long line for caption)
#| fig-height: 3
#| fig-width: 4
#| plot(1:7, abs(-3 : 3))
```
Other examples are in the next section.
The reptools repository has helper functions for building a table of figures. To use those, put addCap() or addCap(scap="short caption for figure") as the first line of code in the chunk. If you don’t specify scap the short caption will be taken as the full caption you specified with the caption: markup. At the end of the report you can print the table of figures using the following syntax (but surround the last line with back ticks).
# Figures
r printCap()
1.2 Quarto Built-in Syntax for Enhancing R Output
Helper functions described below allow one to enhance graphical and tabular R output by taking advantage of Quarto formatting features. These functions allow one to produce different formats within one code chunk, e.g., a plot in the margin and a table in a collapsible note appearing after the code chunk. But if you need only one output format within a chunk you can make use of built-in syntax as described here. The yaml-like syntax also allows you to specify heights and widths for figures, plus multi-figure layouts.
Here is some example code with all the markup shown.
```{r}
#| column: margin
#| fig-height: 1
#| fig-width: 3
par(mar=c(2, 2, 0, 0), mgp=c(2, .5, 0))
set.seed(1)
x <- rnorm(1000)
hist(x, nclass=40, main=’’)
x[1:3] # ordinary output stays put
knitr::kable(x[1:3]) # html output put in margin
hist(x, main=’’)
```
This results follow.
par(mar=c(2, 2, 0, 0), mgp=c(2, .5, 0))
set.seed(1)
x <- rnorm(1000)
hist(x, nclass=40, main='')x[1:3] # ordinary output stays put[1] -0.6264538 0.1836433 -0.8356286
knitr::kable(x[1:3]) # html output put in margin| x |
|---|
| -0.6264538 |
| 0.1836433 |
| -0.8356286 |
hist(x, main='')Here are a few markups for figure layout inside R chunks.
Wide page (takes over the margins) and put multiple plots in 1 row:
#| column: screen-inset
#| layout-nrow: 1
When plotting 3 figures put the first 2 in one row and the third in the second row and make it wide.
#| layout: [[1,1], [1]]
Make the top left panel be wider than the top right one.
#| layout: [[70,30], [100]]
Top left and top right panels have equal widths but devote 0.1 of the total width to an empty region between the two top panels.
#| layout: [[45, -10, 45], [100]]
See here for details about figure specifications inside code chunks.
1.3 Quarto Report Writing Helper Functions
Helper functions are defined when you run the Hmisc function getRs to retrieve them from Github, i.e., getRs('reptools.r'). You can get help on these functions by running rsHelp(functionname). Several of the functions construct Quarto callouts which are fenced-off sections of markup that trigger special formatting, especially when producing html. The special formatting includes collapsible sections and marginal notes. Here is a summary of some of the reptools helper functions.
| Function | Purpose |
|---|---|
dataChk |
run a series of logical expressions for checking data consistency, put results in separate tabs using maketabs, and optionally create two summary tabs |
dataOverview |
runs a data overview report |
missChk |
creates a series of analyses of the extent and patterns of missing values in a data table or data frame, and puts graphical summaries in tabs |
htmlList |
print a named list using the names as headers |
kabl |
front-end to knitr::kable and kables. If you run kabl on more than one object it will automatically call kables. |
makecallout |
generic Quarto callout maker used by makecnote, makecolmarg |
makecnote |
print objects or run code and place output in an initially collapsed callout note |
makecolmarg |
print objects or run code and place output in a marginal note |
maketabs |
print objects or run code placing output in separate tabs |
makemermaid |
makes a mermaid diagram with R variable values included in the diagram |
varType |
classify variables in a data table/frame or a vector as continuous, discrete, or non-numeric non-discrete |
conVars |
use varType to extract list of continuous variables |
disVars |
use varType to extract list of discrete variables |
vClus |
run Hmisc::varclus on a dataset after reducing it |
The input to maketabs, as will be demonstrated later, may be a named list, or more commonly, a series of formulas whose right-hand sides are executed and the result of each formula is placed in a separate tab. The left side of the formula becomes the tab label. For makecolmarg there should be no left side of the formula as marginal notes are not labeled. For the named list option the list names become the tab names. Examples of both approaches appear later in this report. In formulas, a left side label must be enclosed in back ticks and not quotes if it is a multi-word string. A wide argument is used to expand the width of the output outside the usual margins. An initblank argument creates a first tab that is empty. This allows one to show nothing until one of the other tabs is clicked. Alternately you can specify as the first formula ` ` ~ ` `.
The two approaches to using maketabs also apply to makecnote and makecolmarg. Examples of the “print an object and place it inside a callout” are given later in the report for makecnote and makecolmarg. Here is an example of the more general formula method that can render any object, including html widgets as produced by plotly graphics. An interactive plotly graphic appears at the bottom of the plots in the right margin. You can single click on elements in the legend to turn them off and on, and double click within the legend to restore to default values.
require(Hmisc)
options(plotlyauto=TRUE) # makes Hmisc use plotly's auto size option
# rather than computing height, width
getRs('reptools.r', put='source')
set.seed(1)
x <- round(rnorm(100, 100, 15))
makecolmarg(~ table(x) + raw + hist(x) + plot(ecdf(x)) + histboxp(x=x))x
67 70 73 77 78 79 81 82 83 84 86 87 88 89 90 91 92 93 94 95
1 1 1 1 1 1 2 1 1 1 1 1 1 3 1 6 1 3 3 2
96 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 116 117
1 6 5 3 2 1 2 2 4 5 2 2 6 2 3 3 1 2 1 3
118 120 121 122 123 124 130 133 136
2 1 1 1 1 2 1 1 1
# or try makecnote(`makecnote example` ~ kabl(table(x)) + hist(x) + ...
# Avoid raw by using kabl(table(x)) instead of table(x)Adding + raw to a formula in makecnote, makecolmarg, or maketabs forces printed results to be treated as raw verbatim R output.
makecallout is a general Quarto callout maker that implements different combinations of the following: list or formula, print or run code, defer executing and only produce the code to execute vs. running the code now, and close the callout or leave it open for more calls.
reptools also has helper functions for interactively accessing information to help in report and analysis building:
| Function | Purpose |
|---|---|
htmlView |
view html-converted objects in RStudio View pane |
htmlViewx |
view html-converted objects in external browser |
1.4 Multi-Output Format Reports
To allow one report to be used to render multiple output formats, especially html and pdf, it is helpful to be able to sense which output format is currently in play, and to use different functions or options to render output explicitly for the current format. Here is how to create variables that can be referenced simply in code throughout the report, and to invoke the plotly graphics package if output is in html to allow interactivity. A small function ggp is defined so that if you run any ggplot2 output through it, the result will be automatically converted to plotly using the ggplotly function, otherwise it is left at standard static ggplot2 output if html is not the output target.
outfmt <- if(knitr::is_html_output ()) 'html' else 'pdf'
markup <- if(knitr::is_latex_output()) 'latex' else 'html'
ishtml <- outfmt == 'html'
if(ishtml) require(plotly)
ggp <- if(ishtml) ggplotlyr else function(ggobject, ...) ggobject
# See below for more about ggplotlyr (a front end for ggplotly that can
# correct a formatting issue with hover text)The Hmisc, rms, and rmsb packages have a good deal of support for creating \(\LaTeX\) output in addition to html. They require some special \(\LaTeX\) packages to be accessed. In addition, if using any of Quarto’s nice features for making marginal notes, there is another \(\LaTeX\) package to attach. Below you’ll find what needs to be added to the yaml prologue at the top of your script if using Quarto. You have to modify pdf-engine to suit your needs. I use luatex because it handles special unicode characters. In the future (approximately July 2022) a bug in Pandoc will be fixed and you can put links-as-notes: true in the yaml header instead of redefining href and linking in hyperref.
format:
html:
self-contained: true
. . .
pdf:
pdf-engine: lualatex
toc: false
number-sections: true
number-depth: 2
top-level-division: section
reference-location: document
listings: false
header-includes:
\usepackage{marginnote, here, relsize, needspace, setspace, hyperref}
\renewcommand{\href}[2]{#2\footnote{\url{#1}}}
The href redefinition above turns URLs into footnotes if running \(\LaTeX\).
There is one output element provided by Quarto that will not render correctly to \(\LaTeX\): a marginal note using the markup .column-margin. To automatically use an alternate in-body format, define a function that can be used for both typesetting formats.
mNote <- if(ishtml) '.column-margin'
else
'.callout-note appearance="minimal"'Then use r mNote enclosed in back ticks in place of the .column-margin callout for generality.
Even when producing only html, one may wish to save individual graphics for manuscript writing. For non-interactive graphics you can right click on the image and download the .png file. For interactive plots, plotly shows a “take a snapshot” icon when you hover over the image. Clicking this icon will produce a static .png snapshot of the graph. Some graphs are not appropriate for static documents, and the variables created in the code above can be checked so that, for example, an alternative graph can be produced when making a .pdf file. But in other cases one just produces an additional static plot that is not shown in the html report. See the margin note near Figure 58 for an example.
Hmisc Formatting for Variable Labels in Tables
As done with various Hmisc and rms package functions, one can capitalize on Hmisc’s special formatting of variable labels and units when constructing tables in \(\LaTeX\) or html. The basic constructs are shown in the code below.
# Retrieve a set of markup functions depending on typesetting format
# See below for definition of ishtml
specs <- markupSpecs[[if(ishtml) 'html' else 'latex']]
# Hmisc markupSpecs functions create plain text, html, latex,
# markdown, or plotmath code
varlabel <- specs$varlabel # retrieve an individual function
# Format text describing variable named x
# hfill=TRUE typesets units to be right-justified in label
# Use the following character string as a row label
# Default specifies the string to use if there is not label
# (usually taken as the variable name)
varlabel(label(x, default='x'), units(x), hfill=TRUE)2 File Directory Structure
I have a directory for each major project, and put everything in that directory (including data files) except for graphics files for figures, which are placed in their own subdirectory underneath the project folder. The directory name for the project includes key identifying information, and files within that directory do not contain project information in their names, nor do they contain dates, unless I want to freeze an old version of an analysis script.
Quarto I specify that html files are to be self-contained, so there are no separate graphics files.For multi-analyst projects or ones in which you want to capture the entire code history, having the project on github is worthwhile.
3 Analysis File Creation
I typically create a compact analysis file in a separate R script called create.r and have it produce compressed R binary data.frame or data.table .rds files using saveRDS(name, 'name.rds', compress='xz'). Then I have an analysis script named for example a.qmd (for Quarto reports) or a.Rmd (for RMarkdown reports) that starts with d <- readRDS('name.rds').
When variables need to be recoded, have labels added or changed, or have units of measurement added, I specify those using the Hmisc package upData function.
describe and contents function outputs below and in axis labels for graphics.To facilitate some operations requiring variable names to be quoted, define a function .q to quote them automatically. .q is like the Hmisc function Cs but also allows elements to be named. It will be in Hmisc 4.7-1.
.q <- function(...) {
s <- sys.call()[-1]
w <- as.character(s)
n <- names(s)
if(length(n)) names(w) <- n
w
}
.q(a, b, c, 'this and that')[1] "a" "b" "c" "this and that"
.q(dog=a, giraffe=b, cat=c) dog giraffe cat
"a" "b" "c"
Here is an upData example:
# Function to recode from atypical coding for yes/no in raw data
yn <- function(x) factor(x, 0:1, c('yes', 'no'))
d <-
upData(d,
rename = .q(gender=sex, any.event=anyEvent),
posSE = yn(posSE),
newMI = yn(newMI),
newPTCA = yn(newPTCA),
newCABG = yn(newCABG),
death = yn(death),
hxofHT = yn(hxofHT),
hxofDM = yn(hxofDM),
hxofCig = factor(hxofCig, c(0, 0.5, 1),
c('heavy', 'moderate', 'non-smoker')),
hxofMI = yn(hxofMI),
hxofPTCA = yn(hxofPTCA),
hxofCABG = yn(hxofCABG),
chestpain= yn(chestpain),
anyEvent = yn(anyEvent),
drop=.q(event.no, phat, mics, deltaEF,
newpkmphr, gdpkmphr, gdmaxmphr, gddpeakdp, gdmaxdp,
hardness),
labels=c(
bhr = 'Basal heart rate',
basebp = 'Basal blood pressure',
basedp = 'Basal Double Product bhr*basebp',
age = 'Age',
pkhr = 'Peak heart rate',
sbp = 'Systolic blood pressure',
dp = 'Double product pkhr*sbp',
dose = 'Dose of dobutamine given',
maxhr = 'Maximum heart rate',
pctMphr = 'Percent maximum predicted heart rate achieved',
mbp = 'Maximum blood pressure',
dpmaxdo = 'Double product on max dobutamine dose',
dobdose = 'Dobutamine dose at max double product',
baseEF = 'Baseline cardiac ejection fraction',
dobEF = 'Ejection fraction on dobutamine',
chestpain = 'Chest pain',
ecg = 'Baseline electrocardiogram diagnosis',
restwma = 'Resting wall motion abnormality on echocardiogram',
posSE = 'Positive stress echocardiogram',
newMI = 'New myocardial infarction',
newPTCA = 'Recent angioplasty',
newCABG = 'Recent bypass surgery',
hxofHT = 'History of hypertension',
hxofDM = 'History of diabetes',
hxofMI = 'History of myocardial infarction',
hxofCig = 'History of smoking',
hxofPTCA = 'History of angioplasty',
hxofCABG = 'History of coronary artery bypass surgery',
anyEvent = 'Death, newMI, newPTCA, or newCABG'),
units=.q(age=years, bhr=bpm, basebp=mmHg, basedp='bpm*mmHg',
pkhr=mmHg, sbp=mmHg, dp='bpm*mmHg', maxhr=bpm,
mbp=mmHg, dpmaxdo='bpm*mmHg', baseEF='%', dobEF='%',
pctMphr='%', dose=mg, dobdose=mg)
)
saveRDS(d, 'stressEcho.rds', compress='xz')
# Note that we could have automatically recoded all 0:1 variables
# if they were all to be treated identically:
for(x in names(d))
if(all(d[[x]] %in% c(0,1))) d[[x]] <- yn(d[[x]])Sometimes metadata comes from a separate source. Suppose you imported a data frame d but have also imported a data frame m containing metadata: the same variable names in d (variable name) plus fields label, units, and comment. Dataset m can contain variables not currently being used. To add the labels and units into d and to store comments separately, use the following example.
n <- names(d)
i <- n %nin% m$name
if(any(i)) cat('The following variables have no metadata:',
paste(n[i], collapse=', '), '\n')
vcomment <- m$comment
names(vcomment) <- m$name
mn <- subset(m, name %in% n)
labs <- mn$label
un <- mn$units
names(labs) <- names(un) <- mn$name
d <- upData(d, labels=labs, units=un)To look up the comment for a variable at any time use e.g. vcomment['height']. All comments were saved in vector vcomment in case the metadata dictionary m defined variables that are not in d but were to be imported later in the script. Comments for multiple variables can be looked up using e.g. vcomment[.q(height, weight, age)].
If you want to look up a variable’s comment without having to quote its name use the following:
vcom <- function(...)
vcomment[as.character(sys.call()[-1])]
# Example usage: vcom(age,sbp,dbp)The built-in function in R for reading .csv files is read.csv. The Hmisc package has a function csv.get which calls read.csv but offers some enhancements in variable naming, date handling, and reading variable labels from a specified row number. Illegal characters in variable names are changed to periods, and by default underscores are also changed to periods. If any variable names are changed and the labels argument is not given, original variable names are stored in the variable label attributes.
The fread function in the data.table package is blazing fast for reading large files and offers a number of options. csv.get uses it if data.table is installed on the system.
If reading data exported from REDCap that are placed into the project directory I run the following to get rid of duplicate (factor and non-factor versions of variables REDCap produces) variables and automatically convert dates to Date variables:
require(Hmisc)
getRs('importREDCap.r', put='source') # source() code to define function
mydata <- importREDCap() # by default operates on last downloaded export
saveRDS(mydata, 'mydata.rds', compress='xz')
When file names are not given to importREDCap the function looks for the latest created .csv file and .R file with same prefix and uses those. See this for more information.
SAS, Stata, and SPSS binary files are converted to R data.frames using the R haven package. Here’s an example:
require(haven) # after you've installed the package
d <- read_sas('mydata.sas7bdat')
d <- read_xpt('mydata.xpt') # SAS transport files
d <- read_dta('mydata.dta') # Stata
d <- read_sav('mydata.sav') # SPSS
d <- read_por('mydata.por') # Older SPSS filesThese import functions carry variable labels into the data frame and convert dates and times appropriately. Character vectors are not converted to factors.
One of the most important principles to following in programming data analyses is to not do the same thing more than once. Repetitive code wastes time and is harder to maintain. One example of avoiding repetition is in reading a large number of files in R. If the files are stored in one directory and have a consistent file naming scheme (or you want to import every .csv file in the directory), one can avoid naming the individual files. The results may be stored in an R list that has named elements, and there are many processing tasks that can be automated by looping over this list.
In the following example assume that all the data files are .csv files in the current working directory, and they all have names of the form xz*.csv. Let’s read all of them and put each file into a data frame named by the characters in front of .csv. These data frames are stuffed into a list named X. The Hmisc csv.get function is used to read the files, automatically translating dates to Date variables, and because lowernames=TRUE is specified, variable names are translated to lower case. There is an option to fetch variable labels from a certain row of each .csv file but we are not using that.
files <- list.files(pattern='xz.*.csv') # vector of qualifying file names
# Get base part of *.csv
dnames <- sub('.csv', '', files)
X <- list()
i <- 0
for(f in files) {
cat('Reading file', f, '\n')
i <- i + 1
d <- csv.get(f, lowernames=TRUE)
# To read SAS, Stata, SPSS binary files use a haven function instead
# To convert to data.table do setDT(d) here
X[[dnames[i]]] <- d
}
saveRDS(X, 'X.rds', compress='xz') # Efficiently store all datasets togetherTo process one of the datasets one can do things like summary(X[[3]]) or summary(X$baseline) where the third dataset stored was named baseline because it was imported from baseline.csv.
Now there are many possibilities for processing all the datasets at once such as the following.
k <- length(X) # number of datasets
nam <- lapply(X, names) # list with k elements, each is a vector of names
# Get the union of all variable names used in any dataset
sort(unique(unlist(nam))) # all the variables appearing in any dataset
# Get variable names contained in all datasets
common <- names(X[[1]])
for(i in 2 : k) {
common <- intersect(common, names(X[[i]]))
if(! length(common)) break # intersection already empty
}
sort(common)
# Compute number of variables across datasets
nvar <- sapply(X, length) # or ncol
# Print number of observations per dataset
sapply(X, nrow)
# For each variable name count the number of datasets containing it
w <- data.table(dsname=rep(names(X), nvar), vname=unlist(nam))
w[, .N, keyby=vname]
# For each variable create a comma-separated list of datasets
# containing it
w[, .(datasets=paste(sort(dsname), collapse=', ')), keyby=vname]
# For datasets having a subject ID variable named id compute
# the number of unique ids
uid <- function(d) if('id' %in% names(d)) length(unique(d$id)) else NA
sapply(X, uid)
# To repeatedly analyze one of the datasets, extract it to a single data frame
d <- X$baseline
describe(d)The Hmisc package cleanup.import function improves imported data storage in a number of ways including converting double precision variables to integer when originally double but not containing fractional values (this halves the storage requirement). Hmisc::upData is the go-to function for annotating data frames/tables, renaming variables, and dropping variables. Hmisc::dataframeReduced removes problematic variables, e.g., those with a high fraction of missing values or that are binary with very small prevalence.
3.1 Variable Naming
I prefer short but descriptive variable names. As exemplified above, I use variable labels and units to provide more information. For example I wouldn’t name the age variable age.at.enrollment.yrs but would name it age with a label of Age at Enrollment and with units of years. Short, clear names unclutter code, especially formulas in statistical models. One can always fetch a variable label while writing a program (e.g., typing label(d$age) at the console) to check that you have the right variable (or put the data dictionary in a window for easy reference, as shown below). Hmisc package graphics and table making functions such as summaryM and summary.formula specially typeset units in a smaller font.
3.2 Data Dictionary
The Hmisc package contents function will provide a concise data dictionary. Here is an example using the permanent version (which coded binary variables as 0/1 instead of N/Y) of the dataset created above, which can be accessed with the Hmisc getHdata function. The top of the contents output has the number of levels for factor variables hyperlinked. Click on the number to go directly to the list of levels for that variable.
require(Hmisc)
getHdata(stressEcho)
d <- stressEchohtml(contents(d), levelType='table')Data frame:d
558 observations and 31 variables, maximum # NAs:0| Name | Labels | Units | Levels | Storage |
|---|---|---|---|---|
| bhr | Basal heart rate | bpm | integer | |
| basebp | Basal blood pressure | mmHg | integer | |
| basedp | Basal Double Product bhr*basebp | bpm*mmHg | integer | |
| pkhr | Peak heart rate | mmHg | integer | |
| sbp | Systolic blood pressure | mmHg | integer | |
| dp | Double product pkhr*sbp | bpm*mmHg | integer | |
| dose | Dose of dobutamine given | mg | integer | |
| maxhr | Maximum heart rate | bpm | integer | |
| pctMphr | Percent maximum predicted heart rate achieved | % | integer | |
| mbp | Maximum blood pressure | mmHg | integer | |
| dpmaxdo | Double product on max dobutamine dose | bpm*mmHg | integer | |
| dobdose | Dobutamine dose at max double product | mg | integer | |
| age | Age | years | integer | |
| gender | 2 | integer | ||
| baseEF | Baseline cardiac ejection fraction | % | integer | |
| dobEF | Ejection fraction on dobutamine | % | integer | |
| chestpain | Chest pain | integer | ||
| restwma | Resting wall motion abnormality on echocardiogram | integer | ||
| posSE | Positive stress echocardiogram | integer | ||
| newMI | New myocardial infarction | integer | ||
| newPTCA | Recent angioplasty | integer | ||
| newCABG | Recent bypass surgery | integer | ||
| death | integer | |||
| hxofHT | History of hypertension | integer | ||
| hxofDM | History of diabetes | integer | ||
| hxofCig | History of smoking | 3 | integer | |
| hxofMI | History of myocardial infarction | integer | ||
| hxofPTCA | History of angioplasty | integer | ||
| hxofCABG | History of coronary artery bypass surgery | integer | ||
| any.event | Death, newMI, newPTCA, or newCABG | integer | ||
| ecg | Baseline electrocardiogram diagnosis | 3 | integer |
| Variable | Levels |
|---|---|
| gender | male |
| female | |
| hxofCig | heavy |
| moderate | |
| non-smoker | |
| ecg | normal |
| equivocal | |
| MI |
You can write the text output of contents into a text file in your current working directory, and click on that file in the RStudio Files window to create a new tab in the editor panel where you can view the data dictionary at any time. This is especially helpful if you need a reminder of variable definitions that are stored in the variable labels. Here is an example where the formatted data dictionary is saved.
xless system command installed can pop up a contents window at any time by typing xless(contents(d)) in the console. xless is in Hmisc.capture.output(contents(d), file='contents.txt')Or put the html version of the data dictionary into a small browser window to which you can refer at any point in analysis coding.
cat(html(contents(d)), file='contents.html')
browseURL('contents.html', browser='vivaldi -new-window')RStudio provides a nice way to do this, facilitated by the htmlView helper function in reptools. htmlView takes any number of objects for which an html method exists to render them. They are rendered in the RStudio Viewer pane. If you are running outside RStudio, your default browser will be launched instead.
RStudio Viewer will drop its arrow button making it impossible to navigate back and forth to different html outputs.Code for
htmlView and htmlViewx may be viewed in reptools.r.getRs('reptools.r', put='source')
# reptools.r defines htmlView, htmlViewx, kabl, maketabs, dataChk, ...
htmlView(contents(d))In some cases it is best to have a browser window open to the full descriptive statistics for a data table/frame (see below; the describe function also shows labels, units, and levels).
htmlView.To be able to have multiple windows open to see information about datasets it is advantageous to open an external browser window. The htmlViewx function will by default open a Vivaldi browser window with the first output put in a new window and all subsequent objects displayed as tabs within that same window. This behavior can be controlled with the tab argument, and you can use a different browser by issuing for example options(vbrowser='firefox'). As an example suppose that two datasets were read from the hbiostat.org/data data repository, and the data dictionary and descriptive statistics for both datasets were to be converted to html and placed in an external browser window for the duration of the R session.
firefox.exe. The tab names will not be correct until Hmisc 4.7-1 appears.getHdata(support)
getHdata(support2)
htmlViewx(contents(support ), describe(support ),
contents(support2), describe(support2))A screenshot of the result is here.
4 Missing Data
It is extremely important to understand the extent and patterns of missing data, starting with charting the marginal fraction of observations with NAs for each variable. The occurrence of simultaneous missings on multiple variables makes multiple imputation and analysis more difficult, so it is important to correlate and quantify missingness in variables multiple ways. The Hmisc package naclus, naplot, and combplotp functions provide a number of graphics along these lines.
The missChk function in reptools uses these functions and others to produces a fairly comprehensive missingness report, placing each output in its own tab. When the number of variables containing any NA is small, the Hmisc na.pattern function’s simple output is by default all that is shown, and only one sentence is produced if there are no variables with NAs. Here is an example using again the support dataset. the 1000-patient support dataset on hbiostat.org/data, retrieved with the Hmisc function getHdata. Variables with no missing values are excluded from the report (except for being used in the predictive model at the end) to save space. The chart in the next-to-last tab is interactive. We also use the prednmiss options to run an ordinal logistic regression model to predict the number of missing variables from the values of all the non-missing variables, omitting the predictor dzclass because it is redundant with the variable dzgroup. The results of this analysis are in the last tab.
getRs('reptools.r', put='source') # Define dataChk, missChk, maketabs
require(data.table)
getHdata(support)
# Make it into a data table
setDT(support)
# Remove one variable we'll not be using
support[, adlsc := NULL]
missChk(support, prednmiss=TRUE, omitpred = ~ dzclass,
baselabel='misscheck')15 variables have no NAs and 19 variables have NAs
| adlp | urine | alb | pafi | income | adls | bili | totcst | sfdm2 | wblc |
|---|---|---|---|---|---|---|---|---|---|
| 634 | 192 | 70 | 34 | 16 | 11 | 6 | 2 | 2 | 1 |
rms::lrm(formula = as.formula(form), data = d)Frequencies of Responses
0 1 2 3 4 5 6 7 8 9 10 11 12 13 32 100 96 114 134 130 125 90 82 44 34 8 9 2
| Model Likelihood Ratio Test |
Discrimination Indexes |
Rank Discrim. Indexes |
|
|---|---|---|---|
| Obs 1000 | LR χ2 197.64 | R2 0.181 | C 0.660 |
| max |∂log L/∂β| 2×10-8 | d.f. 20 | R220,1000 0.163 | Dxy 0.319 |
| Pr(>χ2) <0.0001 | R220,988.6 0.164 | γ 0.319 | |
| Brier 0.177 | τa 0.287 |
The likelihood ratio \(\chi^2\) test in the last tab is a test of whether any of a subject’s non-missing variable values are associated with the number of missing variables on the subject. It shows strong evidence for such associations. From the dot chart we see that the strongest predictors of missing baseline variables are time to death/censoring and disease group. This may be due to patients on ventilators not being able to provide as much baseline information such as activities of daily living (adlp), and being on a ventilator is a strong prognostic sign. There is a possible sex effect worth investigating.
5 Data Checking
Besides useful descriptive statistics exemplified below, it is important to flag suspicious values in an automated way. Since checking multiple columns may involve a large number of R expressions to run to classify observations as suspicious, let’s automate the process somewhat by specifying a vector of expressions. Then we have R “compute on the language” to parse the expressions for finding observations to flag, and for printing. This is done by the dataChk function in Github.
The following code results in separate output for each individual data check, in separate Quarto tabs. The dataset does not have a subject ID variable so let’s create one, and also add a site variable to print. Arguments are specified to dataChk so that no tab is produced for a condition that never occurred in the data, and a tab is produced showing all data flags, sorted by id and site.
w <- d
setDT(w)
w[, id := 1 : .N]
set.seed(1)
w[, site := sample(LETTERS[1:6], .N, replace=TRUE)]
checks <- expression(
age < 30 | age > 90,
gender == 'female' & maxhr > 170,
baseEF %between% c(72, 77),
baseEF > 77,
baseEF > 99,
sbp > 250 & maxhr < 160)
dataChk(w, checks, id=c('id', 'site'),
omit0=TRUE, byid=TRUE, html=TRUE)id site age 1: 14 B 29 2: 23 F 91 3: 30 D 26 4: 60 A 91 5: 64 D 92 6: 116 A 28 7: 235 F 91 8: 259 D 29 9: 313 C 93
id site gender maxhr 1: 11 A female 171 2: 89 B female 182 3: 412 E female 200
id site baseEF 1: 56 B 72 2: 200 A 75 3: 272 A 72 4: 366 E 74 5: 406 B 77 6: 433 A 74 7: 495 D 72 8: 496 E 74
id site baseEF 1: 299 D 79 2: 434 E 83
id site sbp maxhr 1: 51 B 309 146 2: 146 E 283 135 3: 353 D 274 117
id site Check Values
1: 11 A gender == "female" & maxhr > 170 female 171
2: 14 B age < 30 | age > 90 29
3: 23 F age < 30 | age > 90 91
4: 30 D age < 30 | age > 90 26
5: 51 B sbp > 250 & maxhr < 160 309 146
6: 56 B baseEF [72, 77] 72
7: 60 A age < 30 | age > 90 91
8: 64 D age < 30 | age > 90 92
9: 89 B gender == "female" & maxhr > 170 female 182
10: 116 A age < 30 | age > 90 28
11: 146 E sbp > 250 & maxhr < 160 283 135
12: 200 A baseEF [72, 77] 75
13: 235 F age < 30 | age > 90 91
14: 259 D age < 30 | age > 90 29
15: 272 A baseEF [72, 77] 72
16: 299 D baseEF > 77 79
17: 313 C age < 30 | age > 90 93
18: 353 D sbp > 250 & maxhr < 160 274 117
19: 366 E baseEF [72, 77] 74
20: 406 B baseEF [72, 77] 77
21: 412 E gender == "female" & maxhr > 170 female 200
22: 433 A baseEF [72, 77] 74
23: 434 E baseEF > 77 83
24: 495 D baseEF [72, 77] 72
25: 496 E baseEF [72, 77] 74
id site Check Values
Check n 1 age < 30 | age > 90 9 2 gender == "female" & maxhr > 170 3 3 baseEF [72, 77] 8 4 baseEF > 77 2 5 baseEF > 99 0 6 sbp > 250 & maxhr < 160 3
6 Data Overview
6.1 Filtering of Observations
When the number of subjects in the final analysis is less than the number of subjects initially entering the study, it is important to detail the observation filtering process. This is often done with a consort diagram, and fortunately R has the consort package by Alim Dayim for drawing data-driven diagrams. To demonstrate its use let’s simulate a two-treatment randomized clinical trial in which 1000 subjects were screened for participation.
N <- 1000
set.seed(1)
r <- data.table(id = 1 : N)
r[, .q(id, age, pain, hxmed) :=
.(1 : N,
round(rnorm(N, 60, 15)),
sample(0 : 5, N, replace=TRUE),
sample(0 : 1, N, replace=TRUE, prob=c(0.95, 0.05)) ) ]
# Set consent status to those not excluded at screening
r[age >= 40 & pain > 0 & hxmed == 0,
consent := sample(0 : 1, .N, replace=TRUE, prob=c(0.1, 0.9))]
# Set randomization status for those consenting
r[consent == 1,
randomized := sample(0 : 1, .N, replace=TRUE, prob=c(0.15, 0.85))]
# Add treatment and follow-up time to randomized subjects
r[randomized == 1, tx := sample(c('A', 'B'), .N, replace=TRUE)]
r[randomized == 1, futime := pmin(runif(.N, 0, 10), 3)]
# Add outcome status for those followed 3 years
# Make a few of those followed 3 years missing
r[futime == 3,
y := sample(c(0, 1, NA), .N, replace=TRUE, prob=c(0.75, 0.2, 0.05))]
# Print first 15 subjects
kabl(r[1 : 15, ])| id | age | pain | hxmed | consent | randomized | tx | futime | y |
|---|---|---|---|---|---|---|---|---|
| 1 | 51 | 2 | 1 | NA | NA | NA | NA | NA |
| 2 | 63 | 2 | 0 | 1 | 1 | A | 3.0000 | 0 |
| 3 | 47 | 1 | 0 | 1 | 1 | A | 3.0000 | NA |
| 4 | 84 | 5 | 0 | 1 | 0 | NA | NA | NA |
| 5 | 65 | 3 | 0 | 1 | 1 | B | 3.0000 | 1 |
| 6 | 48 | 4 | 0 | 1 | 1 | A | 3.0000 | 0 |
| 7 | 67 | 3 | 0 | 1 | 1 | B | 2.0566 | NA |
| 8 | 71 | 0 | 0 | NA | NA | NA | NA | NA |
| 9 | 69 | 5 | 0 | 1 | 1 | B | 1.2815 | NA |
| 10 | 55 | 2 | 0 | 1 | 1 | B | 1.2388 | NA |
| 11 | 83 | 4 | 0 | 1 | 1 | A | 3.0000 | 1 |
| 12 | 66 | 5 | 0 | 1 | 1 | A | 3.0000 | 0 |
| 13 | 51 | 1 | 0 | 1 | 1 | B | 3.0000 | 0 |
| 14 | 27 | 2 | 0 | NA | NA | NA | NA | NA |
| 15 | 77 | 2 | 0 | 1 | 1 | A | 3.0000 | 0 |
Now show the flow of qualifications and exclusions in a consort diagram. consort_plot wants multiple reasons for exclusions to be prioritized in a hierarchy, and NA to be used to denote “no exclusion”. Use the reptools function seqFreq that creates a factor variable whose first level is the most common exclusion, second level is the most common exclusion after excluding subjects based on the first exclusion, and so on. seqFreq also returns an attribute obs.per.numcond with a frequency tabulation of the number of observations having a given number of conditions.
Had we wanted a non-data-dependent hierarchy we could have used
r[, exc := factor(fcase(pain == 0, 1,
hxmed == 1, 2,
age < 40 , 3,
default=NA),
1:3, c('pain-free', 'Hx medication', 'age < 40'))]addCap()
r[, exc := seqFreq('pain-free' = pain == 0,
'Hx med' = hxmed == 1,
age < 40,
noneNA=TRUE)]
eo <- attr(r[, exc], 'obs.per.numcond')
mult <- paste0('1, 2, ≥3 exclusions: n=',
eo[2], ', ',
eo[3], ', ',
eo[-(1:3)] )
r[, .q(qual, consent, fin) :=
.(is.na(exc),
ifelse(consent == 1, 1, NA),
ifelse(futime >= 3, 1, NA))]
require(consort)
consort_plot(r,
orders = c(id = 'Screened',
exc = 'Excluded',
qual = 'Qualified for Randomization',
consent = 'Consented',
tx = 'Randomized',
fin = 'Finished',
y = 'Outcome\nassessed'),
side_box = 'exc',
allocation = 'tx',
labels=c('1'='Screening', '3'='Consent', '4'='Randomization', '6'='Follow-up'),
coords=c(0.4, 0.6), cex=0.65,
)consort allows provides another way to build the diagram that may be more flexible and intuitive after we define some helper functions. The built-in R pipe operator |> is used.
#|
addCap()
h <- function(n, label) paste0(label, ' (n=', n, ')')
htab <- function(x, label=NULL, split=! length(label), br='\n') {
tab <- table(x)
w <- if(length(label)) paste0(h(sum(tab), label), ':', br)
f <- if(split) h(tab, names(tab))
else
paste(paste0(' ', h(tab, names(tab))), collapse=br)
if(split) return(f)
paste(w, f, sep=if(length(label))'' else br)
}
count <- function(x, by=rep(1, length(x)))
tapply(x, by, sum, na.rm=TRUE)
w <- r[, {
g <-
add_box(txt=h(nrow(r), 'Screened')) |>
add_side_box(htab(exc, 'Excluded')) |>
add_box(h(count(is.na(exc)), 'Qualified for Randomization')) |>
add_box(h(count(consent), 'Consented')) |>
add_box(h(count(randomized), 'Randomized')) |>
add_split(htab(tx)) |>
add_box(h(count(fin, tx), 'Finished')) |>
add_box(h(count(! is.na(y), tx), 'Outcome\nassessed')) |>
add_label_box(c('1'='Screening', '3'='Consent',
'4'='Randomization', '6'='Follow-up'))
plot(g)
}
]The mermaid natural diagramming language can be used to make consort-like flowcharts among many other types of diagrams. Quarto has built-in support for mermaid, and a reptools function makemermaid uses the capabilities of the knit_expand function in the knitr package to allow variable values to easily be inserted into mermaid specifications using the {{variablename}} notation. Whole diagram elements can be inserted by having {{x}} in a separate mermaid input line, where x is a character string that R constructs. results='asis' needs to be in the chunk header to use makemermaid, and any node labels that contain parentheses must be enclosed in quotes. In the first example, exclusions are in one large node and a special output class (used to limited effect here) is used for this node.
cap <- 'Consort diagram produced by `mermaid`'
addCap('fig-mermaid1', cap)
# TD = top down
x <- '%%| fig-cap: "{{cap}}"
%%| label: fig-mermaid1
flowchart TD
S["Screened (n={{N0}})"] --> E["{{excl}}"]
S --> Q["Qualified for Randomization (n={{Nq}})"]
Q --> C["Consented (n={{Nc}})"]
C --> R["Randomized (n={{Nr}})"]
R --> TxA["A (n={{Ntxa}})"]
R --> TxB["B (n={{Ntxb}})"]
TxA --> FA["Finished (n={{Ntxaf}})"]
TxB --> FB["Finished (n={{Ntxbf}})"]
FA --> OA["Outcome assessed (n={{Ntxao}})"]
FB --> OB["Outcome assessed (n={{Ntxbo}})"]
classDef largert fill:lightgray,width:1.5in,height:10em,text-align:right,font-size:0.8em
class E largert;
'
w <- r[,
makemermaid(x,
cap = cap,
N0 = nrow(r),
excl = htab(exc, 'Excluded', br='<br>'),
Nq = count(is.na(exc)),
Nc = count(consent),
Nr = count(randomized),
Ntxa = count(tx == 'A'),
Ntxb = count(tx == 'B'),
Ntxaf= count(tx == 'A' & fin),
Ntxbf= count(tx == 'B' & fin),
Ntxao= count(tx == 'A' & ! is.na(y)),
Ntxbo= count(tx == 'B' & ! is.na(y))
)
]
flowchart TD S["Screened (n=1000)"] --> E["Excluded (n=286):
pain-free (n=156)
age < 40 (n=85)
Hx med (n=45)"] S --> Q["Qualified for Randomization (n=714)"] Q --> C["Consented (n=634)"] C --> R["Randomized (n=534)"] R --> TxA["A (n=285)"] R --> TxB["B (n=249)"] TxA --> FA["Finished (n=204)"] TxB --> FB["Finished (n=175)"] FA --> OA["Outcome assessed (n=196)"] FB --> OB["Outcome assessed (n=165)"] classDef largert fill:lightgray,width:1.5in,height:10em,text-align:right,font-size:0.8em class E largert;
In the second mermaid example, each exclusion is a subnode to the overall count of exclusions, and a new node is added to inform us about multiple exclusions per subject. To provide fuller information about multiple exclusions, the entire frequency distribution of the number of exclusions per subject appears as a hover text tooltip (thus the new node is not really needed). Let’s also remove the need for more error-prone manual coding of parallel treatment groups by creating a function parNodes to do this.
To make mermaid tooltips work the following needs to be placed in your script.
<style>
.mermaidTooltip {
position: absolute;
text-align: center;
max-width: 200px;
padding: 2px;
font-family: 'trebuchet ms', verdana, arial;
font-size: 12px;
background: #ffffde;
border: 1px solid #aaaa33;
border-radius: 2px;
pointer-events: none;
z-index: 100;
}
</style>
exclnodes <- function(x, from='E', root='E', seq=FALSE, remain=FALSE) {
# Create complete node specifications for individual exclusions, each
# linking to overall exclusion count assumed to be in node root.
# Set seq=TRUE to make use of the fact that the exclusions were
# done in frequency priority order so that each exclusion is in
# addition to the previous one. Leave seq=FALSE to make all exclusions
# subservient to root. Use remain=TRUE to include # obs remaining
# remain=TRUE assumes noneNA specified to seqFreq
tab <- table(x)
i <- 1 : length(tab)
rem <- if(remain) paste0(', ', length(x) - cumsum(tab), ' remain')
w <- paste0('["', names(tab), ' (n=', tab, rem, ')"]')
u <- if(seq) ifelse(i == 1, paste0(from, ' --> ', root, '1', w),
paste0(root, i-1, '--> ', root, i, w))
else paste0(from, ' --> ', root, i, w)
paste(u, collapse='\n')
}
# Create parallel treatment nodes
# Treatments are assumed to be in order by the tx variable
# and will appear left to right in the diagram
# Treatment node names correspond to that and are Tx1, Tx2, ...
# root: root of new nodes, from: single node name to connect from
# fromparallel: root of connected-from node name which is to be
# expanded by adding the integers 1, 2, ... number of treatments.
Txs <- r[, if(is.factor(tx)) levels(tx) else sort(unique(tx))]
parNodes <- function(counts, root, from=NULL, fromparallel=NULL,
label=Txs) {
if(! identical(names(counts), Txs)) stop('Txs not consistent')
k <- length(Txs)
ns <- paste0(' (n=', counts, ')')
nodes <- paste0(root, 1 : k, '["', label, ns, '"]')
w <- if(length(fromparallel))
paste0(fromparallel, 1 : k, ' --> ', nodes)
else paste0(from, ' --> ', paste(nodes, collapse=' & '))
paste(w, collapse='\n')
}
# Create tooltip text from tabulation created by seqFreq earlier
efreq <- data.frame('# Exclusions'= (1 : length(eo)) - 1,
'# Subjects' = eo, check.names=FALSE)
efreq <- subset(efreq, `# Subjects` > 0)
# Convert to text which will be wrapped by the html
# verbatim text tags <pre>...</pre>
# This preserves spacing and line breaks
excltab <- paste(capture.output(print(efreq, row.names=FALSE)),
collapse='\n')cap <- 'Consort diagram produced with `mermaid` with individual exclusions linked to the overall exclusions node, and with a tooltip to show more detail'
addCap('fig-mermaid2', cap)
x <- '%%| fig-cap: "{{cap}}"
%%| label: fig-mermaid2
flowchart TD
S["Screened (n={{N0}})"] --> E["Excluded (n={{Ne}})"]
{{exclsep}}
E1 & E2 & E3 --> M["{{mult}}"]
S --> Q["Qualified for Randomization (n={{Nq}})"]
Q --> C["Consented (n={{Nc}})"]
C --> R["Randomized (n={{Nr}})"]
{{txcounts}}
{{finished}}
{{outcome}}
click E callback "<pre>{{excltab}}</pre>"
'
w <- r[,
makemermaid(x,
cap = cap,
N0 = nrow(r),
Ne = count(! is.na(exc)),
exclsep = exclnodes(exc), # add seq=TRUE to put exclusions vertical
excltab = excltab, # tooltip text
mult = mult, # separate node: count multiple exclusions
Nq = count(is.na(exc)),
Nc = count(consent),
Nr = count(randomized),
txcounts = parNodes(table(tx), 'Tx', from='R'),
finished = parNodes(count(fin, by=tx), 'F', fromparallel='Tx',
label='Finished'),
outcome = parNodes(count(! is.na(y), by=tx), 'O',
fromparallel='F', label='Outcome assessed'),
file='mermaid2.md' # save generated code for another use
)
]
flowchart TD
S["Screened (n=1000)"] --> E["Excluded (n=286)"]
E --> E1["pain-free (n=156)"]
E --> E2["age < 40 (n=85)"]
E --> E3["Hx med (n=45)"]
E1 & E2 & E3 --> M["1, 2, ≥3 exclusions: n=260, 25, 1"]
S --> Q["Qualified for Randomization (n=714)"]
Q --> C["Consented (n=634)"]
C --> R["Randomized (n=534)"]
R --> Tx1["A (n=285)"] & Tx2["B (n=249)"]
Tx1 --> F1["Finished (n=204)"]
Tx2 --> F2["Finished (n=175)"]
F1 --> O1["Outcome assessed (n=196)"]
F2 --> O2["Outcome assessed (n=165)"]
click E callback " # Exclusions # Subjects
0 714
1 260
2 25
3 1"
Hover the mouse over Excluded to see details
Use a similar approach for displaying missing data exclusions in support with respect to selected variables. The reptools missChk function will automatically start with the variable having the highest number of NAs, then go to the variable that is most missing after removing observations that are missing on the first variable, etc.
cap <- 'Flowchart of sequential exclusion of observations due to missing values'
addCap('fig-missflow', cap)
vars <- .q(age, sex, dzgroup, edu, income, meanbp, wblc,
alb, bili, crea, glucose, bun, urine)
ex <- missChk(support, use=vars, type='seq') # seq: don't make report
# Create tooltip text from tabulation created by seqFreq
oc <- attr(ex, 'obs.per.numcond')
freq <- data.frame('# Exclusions'= (1 : length(oc)) - 1,
'# Subjects' = oc, check.names=FALSE)
freq <- subset(freq, `# Subjects` > 0)
excltab <- paste(capture.output(print(freq, row.names=FALSE)),
collapse='\n')
x <- '%%| label: fig-missflow
%%| fig-cap: "{{cap}}"
flowchart TD
Enr["Enrolled (n={{N0}})"]
{{exclsep}}
click E1 callback "<pre>{{excltab}}</pre>"
'
makemermaid(x,
cap = cap,
N0 = nrow(support),
exclsep = exclnodes(ex, from='Enr', seq=TRUE, remain=TRUE),
excltab = excltab
)
flowchart TD
Enr["Enrolled (n=1000)"]
Enr --> E1["urine (n=517, 483 remain)"]
E1--> E2["alb (n=197, 286 remain)"]
E2--> E3["income (n=83, 203 remain)"]
E3--> E4["bili (n=23, 180 remain)"]
E4--> E5["edu (n=8, 172 remain)"]
E5--> E6["wblc (n=3, 169 remain)"]
click E1 callback " # Exclusions # Subjects
0 169
1 147
2 122
3 233
4 131
5 134
6 40
7 22
8 1
9 1"
Hover over the second box to see more details.
6.2 Analyzing Data About the Data
The contents function displays the data dictionary and number of missing values for each variable. We can go deeper in summarizing a dataset as a whole. The reptools dataOverview function first produces a brief overview of the dataset with regard to number of variables, observations, and unique subject identifiers (if present). Then it provides a table, and a graphical report with one graph (in one report tab) per type of variable: continuous, discrete, or non-discrete non-numeric. Variable characteristics summarized are the number of distinct values, number of NAs, information measure (see describe), symmetry, modal variable value (most frequent value), frequency of modal value, minimum frequency value, and frequency of that value. When there are tied frequencies the first value is reported. The default display is an interactive plotly scatterplot with the number of distinct values on the x-axis and symmetry on the y-axis. The other data characteristics are shown as hover text. The points are color-coded for intervals of the number of NAs.
Info is the information in a variable relative to the information (1.0) in a variable having no tied values. Symmetry is defined as follows. For non-numeric variables, symmetry is one minus the mean absolute difference between relative category frequencies and the reciprocal of the number of categories. It has a value of 1.0 when the frequencies of all the categories are the same. For continuous variables symmetry is the ratio of the 0.95 quantile minus the mean to the mean minus the 0.05 quantile. This symmetry measure is < 1.0 when the left tail is heavy and > 1 when the right tail of the distribution is heavy.Again use the support dataset.
dataOverview(support, baselabel='dataov')support has 1000 observations (32 complete) and 34 variables (15 complete)
| Minimum | Maximum | Mean | |
|---|---|---|---|
| Per variable | 0 | 634 | 141.6 |
| Per observation | 0 | 13 | 4.8 |
| 0 | 3 | 5 | 6 | 24 | 25 | 105 | 159 | 202 | 250 | 253 | 297 | 310 | 349 | 372 | 378 | 455 | 470 | 517 | 634 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 15 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 32 | 100 | 96 | 114 | 134 | 130 | 125 | 90 | 82 | 44 | 34 | 8 | 9 | 2 |
| Variable | Type | Distinct | Info | Symmetry | NAs | Rarest Value | Frequency of Rarest Value | Mode | Frequency of Mode |
|---|---|---|---|---|---|---|---|---|---|
| age | Continuous | 970 | 1.000 | 0.756 | 0 | 18.042 | 1 | 67.102 | 3 |
| death | Discrete | 2 | 0.665 | 0.832 | 0 | 0 | 332 | 1 | 668 |
| sex | Discrete | 2 | 0.738 | 0.938 | 0 | female | 438 | male | 562 |
| hospdead | Discrete | 2 | 0.567 | 0.753 | 0 | 1 | 253 | 0 | 747 |
| slos | Continuous | 88 | 0.998 | 3.783 | 0 | 41 | 1 | 5 | 78 |
| d.time | Continuous | 582 | 1.000 | 2.988 | 0 | 39 | 1 | 4 | 25 |
| dzgroup | Discrete | 8 | 0.934 | 0.929 | 0 | Colon Cancer | 49 | ARF/MOSF w/Sepsis | 391 |
| dzclass | Discrete | 4 | 0.857 | 0.855 | 0 | Coma | 60 | ARF/MOSF | 477 |
| num.co | Discrete | 8 | 0.937 | 0.904 | 0 | 7 | 1 | 1 | 337 |
| edu | Continuous | 26 | 0.969 | 0.929 | 202 | 1 | 1 | 12 | 241 |
| income | Discrete | 5 | 0.872 | 0.888 | 349 | >$50k | 75 | under $11k | 309 |
| scoma | Continuous | 11 | 0.650 | 7.516 | 0 | 61 | 6 | 0 | 704 |
| charges | Continuous | 968 | 1.000 | 4.328 | 25 | 1635.75 | 1 | 4335 | 2 |
| totcst | Continuous | 896 | 1.000 | 4.032 | 105 | 0 | 1 | 0 | 1 |
| totmcst | Continuous | 618 | 1.000 | 3.656 | 372 | 829.63 | 1 | 0 | 12 |
| avtisst | Continuous | 242 | 1.000 | 1.651 | 6 | 1.667 | 1 | 13 | 38 |
| race | Discrete | 6 | 0.512 | 0.766 | 5 | asian | 9 | white | 781 |
| meanbp | Continuous | 122 | 1.000 | 1.179 | 0 | 20 | 1 | 73 | 33 |
| wblc | Continuous | 283 | 1.000 | 1.878 | 24 | 0.07 | 1 | 6.3 | 16 |
| hrt | Continuous | 124 | 1.000 | 1.297 | 0 | 11 | 1 | 80 | 48 |
| resp | Continuous | 45 | 0.993 | 1.195 | 0 | 7 | 1 | 20 | 147 |
| temp | Continuous | 64 | 0.999 | 1.262 | 0 | 32.5 | 1 | 36.398 | 57 |
| pafi | Continuous | 464 | 1.000 | 1.446 | 253 | 34 | 1 | 240 | 8 |
| alb | Continuous | 39 | 0.998 | 1.202 | 378 | 4.399 | 1 | 2.6 | 37 |
| bili | Continuous | 116 | 0.997 | 6.559 | 297 | 2.9 | 1 | 0.6 | 65 |
| crea | Continuous | 88 | 0.997 | 4.284 | 3 | 0.3 | 1 | 0.9 | 79 |
| sod | Continuous | 42 | 0.997 | 1.246 | 0 | 120 | 1 | 135 | 86 |
| ph | Continuous | 54 | 0.998 | 0.710 | 250 | 6.96 | 1 | 7.399 | 49 |
| glucose | Continuous | 227 | 1.000 | 2.711 | 470 | 11 | 1 | 96 | 9 |
| bun | Continuous | 107 | 1.000 | 2.672 | 455 | 2 | 1 | 18 | 21 |
| urine | Continuous | 360 | 1.000 | 1.677 | 517 | 1 | 1 | 0 | 12 |
| adlp | Discrete | 9 | 0.842 | 0.880 | 634 | 7 | 4 | 0 | 194 |
| adls | Discrete | 9 | 0.899 | 0.903 | 310 | 7 | 30 | 0 | 313 |
| sfdm2 | Discrete | 6 | 0.874 | 0.844 | 159 | Coma or Intub | 7 | <2 mo. follow-up | 338 |
|
|
7 Descriptive Statistics
The Hmisc describe function is my main tool for getting initial descriptive statistics and quality controlling the data in a univariate fashion. Here is an example. The Info index is a measure of the information content in a numeric variable relative to the information in a continuous numeric variable with no ties. A very low value of Info will occur when a highly imbalanced variable is binary. Clicking on Glossary on the right will pop up a browser window with a more in-depth glossary of terms used in Hmisc package output. It links to hbiostat.org/R/glossary.html which you can link from your reports that use Hmisc.
Info comes from the approximate formula for the variance of a log odds ratio for a proportional odds model/Wilcoxon test, due to Whitehead.Glossary
Gmd in the output stands for Gini’s mean difference—the mean absolute difference over all possible pairs of different observations. It is a very interpretable measure of dispersion that is more robust than the standard deviation.
describe Output
# The callout was typed manually; could have run
# makecnote(~ html(describe(d)), wide=TRUE)
w <- describe(d)
html(w)31 Variables 558 Observations
bhr: Basal heart rate bpm
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 558 | 0 | 69 | 0.999 | 75.29 | 16.57 | 54.0 | 58.0 | 64.0 | 74.0 | 84.0 | 95.3 | 102.0 |
basebp: Basal blood pressure mmHg
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 558 | 0 | 94 | 0.998 | 135.3 | 23.35 | 104.0 | 110.0 | 120.0 | 133.0 | 150.0 | 162.3 | 170.1 |
basedp: Basal Double Product bhr*basebp bpm*mmHg
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 558 | 0 | 441 | 1 | 10181 | 2813 | 6607 | 7200 | 8400 | 9792 | 11663 | 13610 | 14770 |
pkhr: Peak heart rate mmHg
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 558 | 0 | 105 | 1 | 120.6 | 25.36 | 81.85 | 90.70 | 106.25 | 122.00 | 135.00 | 147.00 | 155.15 |
sbp: Systolic blood pressure mmHg
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 558 | 0 | 142 | 1 | 146.9 | 40.72 | 96 | 102 | 120 | 141 | 170 | 200 | 210 |
dp: Double product pkhr*sbp bpm*mmHg
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 558 | 0 | 508 | 1 | 17634 | 5765 | 10256 | 11341 | 14033 | 17060 | 20644 | 24536 | 26637 |
dose: Dose of dobutamine given mg
| n | missing | distinct | Info | Mean | Gmd |
|---|---|---|---|---|---|
| 558 | 0 | 7 | 0.84 | 33.75 | 8.334 |
Value 10 15 20 25 30 35 40 Frequency 2 28 47 56 64 61 300 Proportion 0.004 0.050 0.084 0.100 0.115 0.109 0.538
maxhr: Maximum heart rate bpm
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 558 | 0 | 103 | 1 | 119.4 | 24.64 | 82.0 | 91.0 | 104.2 | 120.0 | 133.0 | 146.0 | 154.1 |
pctMphr: Percent maximum predicted heart rate achieved %
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 558 | 0 | 78 | 0.999 | 78.57 | 16.86 | 53 | 60 | 69 | 78 | 88 | 97 | 104 |
mbp: Maximum blood pressure mmHg
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 558 | 0 | 132 | 0.999 | 156 | 35.03 | 110.0 | 120.0 | 133.2 | 150.0 | 175.8 | 200.0 | 211.1 |
dpmaxdo: Double product on max dobutamine dose bpm*mmHg
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 558 | 0 | 484 | 1 | 18550 | 5385 | 11346 | 12865 | 15260 | 18118 | 21239 | 24893 | 27477 |
dobdose: Dobutamine dose at max double product mg
| n | missing | distinct | Info | Mean | Gmd |
|---|---|---|---|---|---|
| 558 | 0 | 8 | 0.941 | 30.24 | 10.55 |
Value 5 10 15 20 25 30 35 40 Frequency 7 7 55 73 71 78 62 205 Proportion 0.013 0.013 0.099 0.131 0.127 0.140 0.111 0.367
age: Age years
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 558 | 0 | 62 | 0.999 | 67.34 | 13.41 | 46.85 | 51.00 | 60.00 | 69.00 | 75.00 | 82.00 | 85.00 |
gender
| n | missing | distinct |
|---|---|---|
| 558 | 0 | 2 |
Value male female Frequency 220 338 Proportion 0.394 0.606
baseEF: Baseline cardiac ejection fraction %
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 558 | 0 | 54 | 0.994 | 55.6 | 10.71 | 32 | 40 | 52 | 57 | 62 | 65 | 66 |
dobEF: Ejection fraction on dobutamine %
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 558 | 0 | 60 | 0.992 | 65.24 | 12.38 | 40.0 | 49.7 | 62.0 | 67.0 | 73.0 | 76.0 | 80.0 |
chestpain: Chest pain
| n | missing | distinct | Info | Sum | Mean | Gmd |
|---|---|---|---|---|---|---|
| 558 | 0 | 2 | 0.64 | 172 | 0.3082 | 0.4272 |
restwma: Resting wall motion abnormality on echocardiogram
| n | missing | distinct | Info | Sum | Mean | Gmd |
|---|---|---|---|---|---|---|
| 558 | 0 | 2 | 0.745 | 257 | 0.4606 | 0.4978 |
posSE: Positive stress echocardiogram
| n | missing | distinct | Info | Sum | Mean | Gmd |
|---|---|---|---|---|---|---|
| 558 | 0 | 2 | 0.553 | 136 | 0.2437 | 0.3693 |
newMI: New myocardial infarction
| n | missing | distinct | Info | Sum | Mean | Gmd |
|---|---|---|---|---|---|---|
| 558 | 0 | 2 | 0.143 | 28 | 0.05018 | 0.09549 |
newPTCA: Recent angioplasty
| n | missing | distinct | Info | Sum | Mean | Gmd |
|---|---|---|---|---|---|---|
| 558 | 0 | 2 | 0.138 | 27 | 0.04839 | 0.09226 |
newCABG: Recent bypass surgery
| n | missing | distinct | Info | Sum | Mean | Gmd |
|---|---|---|---|---|---|---|
| 558 | 0 | 2 | 0.167 | 33 | 0.05914 | 0.1115 |
death
| n | missing | distinct | Info | Sum | Mean | Gmd |
|---|---|---|---|---|---|---|
| 558 | 0 | 2 | 0.123 | 24 | 0.04301 | 0.08247 |
hxofHT: History of hypertension
| n | missing | distinct | Info | Sum | Mean | Gmd |
|---|---|---|---|---|---|---|
| 558 | 0 | 2 | 0.625 | 393 | 0.7043 | 0.4173 |
hxofDM: History of diabetes
| n | missing | distinct | Info | Sum | Mean | Gmd |
|---|---|---|---|---|---|---|
| 558 | 0 | 2 | 0.699 | 206 | 0.3692 | 0.4666 |
hxofCig: History of smoking
| n | missing | distinct |
|---|---|---|
| 558 | 0 | 3 |
Value heavy moderate non-smoker Frequency 122 138 298 Proportion 0.219 0.247 0.534
hxofMI: History of myocardial infarction
| n | missing | distinct | Info | Sum | Mean | Gmd |
|---|---|---|---|---|---|---|
| 558 | 0 | 2 | 0.599 | 154 | 0.276 | 0.4004 |
hxofPTCA: History of angioplasty
| n | missing | distinct | Info | Sum | Mean | Gmd |
|---|---|---|---|---|---|---|
| 558 | 0 | 2 | 0.204 | 41 | 0.07348 | 0.1364 |
hxofCABG: History of coronary artery bypass surgery
| n | missing | distinct | Info | Sum | Mean | Gmd |
|---|---|---|---|---|---|---|
| 558 | 0 | 2 | 0.399 | 88 | 0.1577 | 0.2661 |
any.event: Death, newMI, newPTCA, or newCABG
| n | missing | distinct | Info | Sum | Mean | Gmd |
|---|---|---|---|---|---|---|
| 558 | 0 | 2 | 0.402 | 89 | 0.1595 | 0.2686 |
ecg: Baseline electrocardiogram diagnosis
| n | missing | distinct |
|---|---|---|
| 558 | 0 | 3 |
Value normal equivocal MI Frequency 311 176 71 Proportion 0.557 0.315 0.127
# To create a separate browser window:
cat(html(w), file='desc.html')
browseURL('desc.html', browser='firefox -new-window')Better, whether using RStudio or not:
htmlView(w, contents(d)) # or htmlView(describe(d1), describe(d2), ...)
# Use htmlViewx to use an external browser window (see above)There is also a plot method for describe output. It produces two graphics objects: one for categorical variables and one for continuous variables. The default is to use ggplot2 to produce static graphics. The result can be fed directly into maketabs described earlier. results='asis' must appear in the chunk header.
cap <- paste('Regular plot(describe) output for',
c('categorical', 'continuous'), 'variables')
maketabs(plot(w, bvspace=2.5), baselabel='pdesc', cap=cap)By specifying grType option you can instead get plotly graphics that use hover text to show more information, especially when hovering over the leftmost dot or tick mark for a variable.
options(grType='plotly')
cap <- paste('plotly plot(describe) output for', c('categorical', 'continuous'),
'variables')
maketabs(plot(w, bvspace=2.5), wide=TRUE, baselabel='pldesc', cap=cap)See this for other Hmisc functions for descriptive graphics and tables, especially for stratified descriptive statistics for categorical variables. The summaryM function prints a tabular summary of a mix of continuous and categorical variables. Here is an example where stratification is by history of myocardial infarction (MI).
require(data.table)
setDT(d) # turn d into a data table
# tables() with no arguments will give a concise summary of all active data tables
w <- d
w[, hxofMI := factor(hxofMI, 0 : 1, c('No history of MI', 'History of MI'))]
vars <- setdiff(names(d), 'hxofMI')
form <- as.formula(paste(paste(vars, collapse='+'), '~ hxofMI'))
print(form)bhr + basebp + basedp + pkhr + sbp + dp + dose + maxhr + pctMphr + mbp + dpmaxdo + dobdose + age + gender + baseEF + dobEF + chestpain + restwma + posSE + newMI + newPTCA + newCABG + death + hxofHT + hxofDM + hxofCig + hxofPTCA + hxofCABG + any.event + ecg ~ hxofMI
s <- summaryM(form, data=d, test=TRUE)
maketabs(
` ` ~ ` `, # empty tab
`Table 1` ~ html(s, exclude1=TRUE, npct='both', digits=3, middle.bold=TRUE),
`Categorical Variable Plot` ~ plot(s, which='categorical', vars=1 : 4, height=600, width=1000),
`Continuous Variable Plot` ~ plot(s, which='continuous', vars=1 : 4),
wide=TRUE, baselabel='summarym', cap=c('', '', '-', '-'))| Descriptive Statistics (N=558). | |||
| No history of MI N=404 |
History of MI N=154 |
Test Statistic |
|
|---|---|---|---|
Basal heart rate bpm |
65 74 85 | 63 72 84 | F1 556=1.41, P=0.2351 |
Basal blood pressure mmHg |
120 134 150 | 120 130 150 | F1 556=1.39, P=0.2381 |
Basal Double Product bhr*basebp bpm*mmHg |
8514 9874 11766 | 8026 9548 11297 | F1 556=3.32, P=0.0691 |
Peak heart rate mmHg |
107 123 136 | 104 120 132 | F1 556=2.35, P=0.1261 |
Systolic blood pressure mmHg |
124 146 174 | 115 134 158 | F1 556=12.1, P<0.0011 |
Double product pkhr*sbp bpm*mmHg |
14520 17783 21116 | 13198 15539 18885 | F1 556=15, P<0.0011 |
Dose of dobutamine given mg : 10 |
0.00 2⁄404 | 0.00 0⁄154 | χ26=8.77, P=0.1872 |
| 15 | 0.05 21⁄404 | 0.05 7⁄154 | |
| 20 | 0.10 40⁄404 | 0.05 7⁄154 | |
| 25 | 0.11 45⁄404 | 0.07 11⁄154 | |
| 30 | 0.11 43⁄404 | 0.14 21⁄154 | |
| 35 | 0.11 45⁄404 | 0.10 16⁄154 | |
| 40 | 0.51 208⁄404 | 0.60 92⁄154 | |
Maximum heart rate bpm |
107 122 134 | 102 118 130 | F1 556=4.05, P=0.0451 |
Percent maximum predicted heart rate achieved % |
69.0 78.0 89.0 | 70.0 77.0 87.5 | F1 556=0.5, P=0.4791 |
Maximum blood pressure mmHg |
138 154 180 | 130 142 162 | F1 556=13, P<0.0011 |
Double product on max dobutamine dose bpm*mmHg |
15654 18666 21664 | 14489 16785 19680 | F1 556=16.1, P<0.0011 |
Dobutamine dose at max double product mg : 5 |
0.01 4⁄404 | 0.02 3⁄154 | χ27=8.5, P=0.292 |
| 10 | 0.01 6⁄404 | 0.01 1⁄154 | |
| 15 | 0.11 43⁄404 | 0.08 12⁄154 | |
| 20 | 0.14 58⁄404 | 0.10 15⁄154 | |
| 25 | 0.13 54⁄404 | 0.11 17⁄154 | |
| 30 | 0.13 51⁄404 | 0.18 27⁄154 | |
| 35 | 0.10 40⁄404 | 0.14 22⁄154 | |
| 40 | 0.37 148⁄404 | 0.37 57⁄154 | |
Age years |
59.0 68.0 75.0 | 63.2 71.0 76.8 | F1 556=9.75, P=0.0021 |
| gender : female | 0.68 273⁄404 | 0.42 65⁄154 | χ21=30, P<0.0012 |
Baseline cardiac ejection fraction % |
55 59 63 | 40 54 60 | F1 556=56.4, P<0.0011 |
Ejection fraction on dobutamine % |
65.0 70.0 74.2 | 50.0 64.5 70.0 | F1 556=50.3, P<0.0011 |
| Chest pain | 0.29 119⁄404 | 0.34 53⁄154 | χ21=1.29, P=0.2572 |
| Resting wall motion abnormality on echocardiogram | 0.57 230⁄404 | 0.18 27⁄154 | χ21=69.7, P<0.0012 |
| Positive stress echocardiogram | 0.21 86⁄404 | 0.32 50⁄154 | χ21=7.56, P=0.0062 |
| New myocardial infarction | 0.03 14⁄404 | 0.09 14⁄154 | χ21=7.4, P=0.0072 |
| Recent angioplasty | 0.02 10⁄404 | 0.11 17⁄154 | χ21=17.8, P<0.0012 |
| Recent bypass surgery | 0.05 21⁄404 | 0.08 12⁄154 | χ21=1.35, P=0.2462 |
| death | 0.04 15⁄404 | 0.06 9⁄154 | χ21=1.23, P=0.2672 |
| History of hypertension | 0.69 280⁄404 | 0.73 113⁄154 | χ21=0.89, P=0.3462 |
| History of diabetes | 0.36 147⁄404 | 0.38 59⁄154 | χ21=0.18, P=0.6742 |
| History of smoking : heavy | 0.21 83⁄404 | 0.25 39⁄154 | χ22=3.16, P=0.2062 |
| moderate | 0.24 96⁄404 | 0.27 42⁄154 | |
| non-smoker | 0.56 225⁄404 | 0.47 73⁄154 | |
| History of angioplasty | 0.04 15⁄404 | 0.17 26⁄154 | χ21=28.4, P<0.0012 |
| History of coronary artery bypass surgery | 0.08 34⁄404 | 0.35 54⁄154 | χ21=59.6, P<0.0012 |
| Death, newMI, newPTCA, or newCABG | 0.12 48⁄404 | 0.27 41⁄154 | χ21=18.1, P<0.0012 |
| Baseline electrocardiogram diagnosis : normal | 0.59 240⁄404 | 0.46 71⁄154 | χ22=8, P=0.0182 |
| equivocal | 0.29 117⁄404 | 0.38 59⁄154 | |
| MI | 0.12 47⁄404 | 0.16 24⁄154 | |
|
a b c represent the lower quartile a, the median b, and the upper quartile c for continuous variables. Tests used: 1Wilcoxon test; 2Pearson test . | |||
Semi-interactive stratified spike histograms are also useful descriptive plots. These plots also contain a superset of the quantiles used in box plots, and the legend is clickable, allowing any of the statistical summaries to be turned off.
addCap()
d[, histboxp(x=maxhr, group=ecg, bins=200)]7.1 Longitudinal Continuous Y
For a continuous response variable measured longitudinally, two of the ways to display the data are
- “spaghetti plots” showing all data for all individual subjects, if the number of subjects is not very large
- finding clusters of individual subject curve characteristics and plotting a random sample of raw data curves within each cluster if the sample size is large
The Hmisc package curveRep function facilitates the latter approach using representative curves. It considers per-subject sample sizes and measurement time gaps as these will not only limit how we look at the data but may be informative, e.g., subjects who are failing may start to get more frequent measurements.
To demonstrate we simulate data in the following way.
- Simulate 200 subjects (curves) with per-curve sample sizes ranging from 1 to 10
- Make curves with odd-numbered IDs have a measurement time distribution that is random uniform [0,1] and those with even-numbered IDs have a time distribution that is half as wide but still centered at 0.5. Shift y values higher with increasing IDs
- Make 1/3 of subjects have a flat trajectory, 1/3 linear and not flat, and 1/3 quadratic
Request curveRep to cluster the 200 curves on the following characteristics:
kn=3sample size groups (curveRepactually used one more than this)kxdist=2time point distribution groups based here on the earliest and latest measurement times and the longest gap between any two measurements within a subjectk=4trajectory clusters. Trajectories are determined byloess-smoothing each subject’s curve and linearly interpolating the estimates to the same evenly-spaced grid of time points, then clustering on these 10 y-estimates. This captures intercepts, slopes, and shapes.
All subjects within each cluster are shown; we didn’t need to take random samples for 200 subjects. Results for the 4 per-curve sample size groups are placed in separate tabs.
set.seed(1)
N <- 200
nc <- sample(1:10, N, TRUE)
id <- rep(1:N, nc)
x <- y <- id
# Solve for coefficients of quadratic function such that it agrees with
# the linear function at x=0.25, 0.75 and is lower by -delta at x=0.5
delta <- -3
cof <- list(c(0, 0, 0), c(0, 10, 0), c(delta, 10, -2 * delta / (2 * 0.25^2)))
for(i in 1 : N) {
x[id==i] <- if(i %% 2) runif(nc[i]) else runif(nc[i], c(.25, .75))
shape <- sample(1:3, 1)
xc <- x[id == i] - 0.5
k <- cof[[shape]]
y[id == i] <- i/20 + k[1] + k[2] * xc + k[3] * xc ^ 2 +
runif(nc[i], -2.5, 2.5)
}
require(cluster)
w <- curveRep(x, y, id, kn=3, kxdist=2, k=4, p=10)
gg <- vector('list', 4)
nam <- rep('', 4)
for(i in 1 : 4) {
z <- plot(w, i, method='data') # method='data' available in Hmisc 4.7-1
z <- transform(z,
distribution = paste('Time Distribution:', distribution),
cluster = paste('Trajectory Cluster:', cluster))
gg[[i]] <-
if(i == 1)
ggplot(z, aes(x, y, color=curve)) + geom_point() +
facet_grid(distribution ~ cluster) +
theme(legend.position='none')
else
ggplot(z, aes(x, y, color=curve)) + geom_line() +
facet_grid(distribution ~ cluster) +
theme(legend.position='none')
nam[i] <- z$ninterval[1]
}
names(gg) <- nam
maketabs(gg, baselabel='curverep')7.2 Longitudinal Ordinal Y
Continuous longitudinal Y may be analyzed flexibly using semiparametric models while being described using representative curves as just discussed. Now suppose that Y is discrete. There are three primary ways of describing such data graphically.
- Show event occurrences/trajectories for a random sample of subjects (
Hmisc::multEventChart) - Show all transition proportions if time is discrete (
Hmisc::propsTrans) - Show all state occupancy proportions if time is discrete (
Hmisc::propsPO)
multEventChartStarting with a multiple event chart, simulate data on 5 patients, then display all the raw data.
addCap()
pt1 <- data.frame(pt=1, day=0:3,
status=.q(well, well, sick, 'very sick'))
pt2 <- data.frame(pt=2, day=c(1,2,4,6),
status=.q(sick, 'very sick', coma, death))
pt3 <- data.frame(pt=3, day=1:5,
status=.q(sick, 'very sick', sick, 'very sick', 'discharged'))
pt4 <- data.frame(pt=4, day=c(1:4, 10),
status=.q(well, sick, 'very sick', well, discharged))
d <- rbind(pt1, pt2, pt3, pt4)
d <- upData(d,
status = factor(status, .q(discharged, well, sick,
'very sick', coma, death)),
labels = c(day = 'Day'), print=FALSE)
kabl(pt1, pt2, pt3, pt4)
|
|
|
|
multEventChart(status ~ day + pt, data=d,
absorb=.q(death, discharged),
colorTitle='Status', sortbylast=TRUE) +
theme_classic() + theme(legend.position='bottom')For an example of displaying transition proportions, whereby all the outcome information in the raw data is shown, simulate some random data. The result is a plotly graphic over which you hover the pointer to see details. The size of the bubbles is proportional to the proportion in that transition.
addCap()
set.seed(1)
d <- expand.grid(id=1:30, time=1:7)
setDT(d) # convert to data.table
d[, sex := sample(c('female', 'male'), .N, replace=TRUE)]
d[, state := sample(LETTERS[1:4], .N, replace=TRUE)]
ggplotlyr(propsTrans(state ~ time + id, data=d))The final display doesn’t capture the within-subject correlation as done with the transition proportions, but is the most familiar display for longitudinal ordinal data as it shows proportions in the current states, which are cumulative incidence estimates for absorbing states.
propsPO to work properly, even though the real data file would terminate follow-up at an absorbing event.addCap()
ggplotlyr(propsPO(state ~ time + sex, data=d))7.3 Adverse Event Chart
When there is a potentially large number of event types, such as adverse events (AEs) in a clinical trial, and the event timing is not considered, a dot chart is an excellent way to present the proportion of subjects suffering each type of AE. The AEs can be sorted in descending order by the difference in proportions between treatments, and plotly hover text can display more details. Half-width confidence intervals are used (see Section 14.6). An AE chart is easily produced using the aePlot function in the qreport repository on github. aePlot expects the dataset to have one record per subject per AE, so the dataset itself does not define the proper denominator and this must be specified by the user (see denom below). The color coded needles in the right margin are guideposts to which denominators are being used in the analysis (details are here).
getRs('misc.r', grepo='qreport')
getRs('aePlot.r', grepo='qreport')
getHdata(aeTestData) # original data source: HH package
# One record per subject per adverse event
# For this example, the denominators for the two treatments in the
# pop-up needles will be incorrect because the dataset did not have
# subject IDs.
ae <- aeTestData
# qreport requires us to define official clinical trial counts and
# name of treatment variable
denom <- c(enrolled = 1000,
randomized = 400,
a=212, b=188)
setqreportOption(tx.var='treat', denom=denom)
aePlot(event ~ treat, data=ae, minincidence=.05, size='wide')| Category | N | Used |
|---|---|---|
| enrolled | 1000 | 400 |
| randomized | 400 | 400 |
| a | 212 | 212 |
| b | 188 | 188 |
7.4 Continuous Event Times
For time-to-event data with possibly multiple types of events, an event chart is a good way to show the raw outcome data for a sample of up to perhaps 40 subjects. The Hmisc package offers the event.chart function written by Jack Lee, Kenneth Hess, and Joel Dubin. Here is an example they provided. Patients are sorted by diagnosis date.
addCap()
getHdata(cdcaids)
event.chart(cdcaids,
subset.c=.q(infedate, diagdate, dethdate, censdate),
x.lab = 'Observation Dates',
y.lab='Patients',
titl='AIDS Data Calendar Event Chart',
point.pch=c(1,2,15,0), point.cex=c(1,1,0.8,0.8),
legend.plot=TRUE, legend.location='i', legend.cex=0.8,
legend.point.text=.q(transfusion,'AIDS diagnosis',death,censored),
legend.point.at = list(c(7210, 8100), c(35, 27)))7.5 Describing Variable Interrelationships
A good descriptive analysis to help understand relationships among variables and redundancies/collinearities is variable clustering. Here one clusters on variables instead of observations, and instead of a distance matrix we have a similarly matrix. A good default similarity matrix is based on the square of pairwise Spearman’s \(\rho\) rank correlation coefficients. In order to restrict ourselves to unsupervised learning, also called data reduction, we restrict attention to non-outcome variables. The vClus function in reptools runs the dataset, after excluding some variables, through the Hmisc dataframeReduce function to eliminate variables that are missing more than 0.4 of the time and to ignore character or factor variables having more than 10 levels. Binary variables having prevalence < 0.05 are dropped, and categorical variables having < 0.05 of their non-missing values in a category will have such low-frequency categories combined into an “other” category for purposes of variable clustering.[Normally one would omit the fracmiss, maxlevels, minprev arguments as the default values are reasonable
addCap()
outcomes <- .q(slos, charges, totcst, totmcst, avtisst,
d.time, death, hospdead, sfdm2)
vClus(support, exclude=outcomes, fracmiss=0.4, maxlevels=10, minprev=0.05)
Variables Removed or Modified
Variable Reason
1 dzgroup grouped categories
2 race grouped categories
3 glucose fraction missing>0.4
4 bun fraction missing>0.4
5 urine fraction missing>0.4
6 adlp fraction missing>0.4
The most strongly related variables are competing indicator variables for categories from the same variable, scoma vs. dzgroup “Coma”, and dzgroup “Cancer” vs. dzclass “Cancer”. The first type of relationship generates a strong negative correlation because if you’re in one category you can’t be in the other.
8 Data Manipulation and Aggregation
The data.table package provides a concise, consistent syntax for managing simple and complex data manipulation tasks, and it is extremely efficient for large datasets. One of the best organized tutorials is this, and a cheatsheet for data transformation is here. A master cheatsheet for data.table is here from which the general syntax below is taken, where DT represents a data table.
DT[ i, j, by ] # + extra arguments
| | |
| | -------> grouped by what?
| -------> what to do?
---> on which rows?
Data tables are also data frames so work on any method handling data frames. But data tables do not contain rownames.
Several data.table examples follow. I like to hold the current dataset in d to save typing. Some basic operations on data tables are:
d[2] # print 2nd row
d[2:4] # print rows 2,3,4
d[y > 2 & z > 3] # rows satisfying conditions
d[, age] # retrieve one variable
d[, .(age, gender)] # make a new table with two variables
i <- 2; d[, ..i] # get column 2
v <- 'age'; d[, ..v] # get variable named by contents of v
d[, (v)] # another way; these return data tables
d[.N] # last row
d[, .N, keyby=age] # number of rows for each age, sorted
d[1:10, bhr:pkhr] # first 10 rows, variables bhr - pkhr
d[1:10, !(bhr:pkhr)] # all but those variables
d[, 2:4] # get columns 2-4
d[, {...}] # run any number of lines of code in ...
# that reference variables inside ddata.table does many of its operations by reference to avoid the overhead of having multiple copies of data tables. This idea carries over to apparent copies of data tables. Here is an example.
a <- data.table(x1=1:3, x2=letters[1:3])
a x1 x2
1: 1 a
2: 2 b
3: 3 c
b <- a # no copy; b is just a pointer to a
b[, x2 := toupper(x2)] # changes a
a x1 x2
1: 1 A
2: 2 B
3: 3 C
a <- data.table(x1=1:3, x2=letters[1:3])
a2 <- copy(a) # fresh copy with its own memory space
a2[, x2 := toupper(x2)] # doesn't change a
a x1 x2
1: 1 a
2: 2 b
3: 3 c
8.1 Analyzing Selected Variables and Subsets
d <- stressEcho
d[, html(describe(age))]| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 558 | 0 | 62 | 0.999 | 67.34 | 13.41 | 46.85 | 51.00 | 60.00 | 69.00 | 75.00 | 82.00 | 85.00 |
d[, html(describe(~ age + gender))]2 Variables 558 Observations
age: Age years
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 558 | 0 | 62 | 0.999 | 67.34 | 13.41 | 46.85 | 51.00 | 60.00 | 69.00 | 75.00 | 82.00 | 85.00 |
gender
| n | missing | distinct |
|---|---|---|
| 558 | 0 | 2 |
Value male female Frequency 220 338 Proportion 0.394 0.606
d[gender == 'female', html(describe(age))] # analyze age for females| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 338 | 0 | 58 | 0.999 | 67.01 | 13.74 | 47.00 | 50.70 | 59.25 | 68.00 | 76.00 | 83.00 | 85.00 |
html(describe(d[, .(age, gender)], 'Age and Gender Stats'))2 Variables 558 Observations
age: Age years
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 558 | 0 | 62 | 0.999 | 67.34 | 13.41 | 46.85 | 51.00 | 60.00 | 69.00 | 75.00 | 82.00 | 85.00 |
gender
| n | missing | distinct |
|---|---|---|
| 558 | 0 | 2 |
Value male female Frequency 220 338 Proportion 0.394 0.606
# Separate analysis by female, male. Use keyby instead of by to sort the usual way.
d[, print(describe(age, descript=gender)), by=gender]male : Age [years]
n missing distinct Info Mean Gmd .05 .10
220 0 51 0.999 67.86 12.91 45.95 51.00
.25 .50 .75 .90 .95
62.00 69.00 75.00 81.00 86.00
lowest : 30 34 38 40 43, highest: 88 89 91 92 93
female : Age [years]
n missing distinct Info Mean Gmd .05 .10
338 0 58 0.999 67.01 13.74 47.00 50.70
.25 .50 .75 .90 .95
59.25 68.00 76.00 83.00 85.00
lowest : 26 28 29 33 34, highest: 87 88 89 90 91
Empty data.table (0 rows and 1 cols): gender
# Compute mean and median age by gender
d[, .(Mean=mean(age), Median=median(age)), by=gender] gender Mean Median
1: male 67.85909 69
2: female 67.00888 68
# To create a new subset
w <- d[gender == 'female' & age < 70, ]8.2 Adding or Changing Variables
With data.table you can create a new data table with added variables, or you can add or redefine variables in a data table in place. The latter has major speed and memory efficiency implications when processing massive data tables. Here d refers to a different data table from the one used above.
# Rename a variable
setnames(d, c('gender', 'height'),
c( 'sex', 'ht'))
# Easier:
setnames(d, .q(gender, height),
.q( sex, ht))
# Or
ren <- .q(gender=sex, height=ht)
setnames(d, names(ren), ren)
# Or
rename <- function(x, n) setnames(x, names(n), n)
rename(d, .q(gender=sex, height=ht))
# For all variables having baseline_ at the start of their names, remove it
n <- names(d)
setnames(d, n, sub('^baseline_', '', n)) # ^ = at start; use $ for at end
# For all variables having baseline_ at the start of their names, remove it
# and add Baseline to start of variable label
n <- names(d)
n <- n[grepl('^baseline_', n)]
ren <- sub('^baseline_', '', n); names(ren) <- n
# Fetch vector of labels for variables whose names start with baseline_
labs <- sapply(d, label)[n] # label() result is "" if no label
labs <- paste('Baseline', labs)
d <- updata(d, rename=ren, labels=labs)
# Change all names to lower case
n <- names(d)
setnames(d, n, tolower(n))
# Abbreviate names of all variables longer than 10 characters
# abbreviate() will ensure that all names are unique
n <- names(d)
setnames(d, n, abbreviate(n, minlength=10))
# For any variables having [...] or (...) in their labels, assume these
# are units of measurement and move them from the label to the units
# attribute
d <- upData(d, moveUnits=TRUE)
# Add two new variables, storing
result in a new data table
z <- d[, .(bmi=wt / ht ^ 2, bsa=0.016667 * sqrt(wt * ht))]
# Add one new variable in place
d[, bmi := wt / ht ^ 2]
# Add two new variables in place
d[, `:=`(bmi = wt / ht ^ 2, bas=0.016667 * sqrt(wt * ht))]
d[, .q(bmi, bsa) := .(wt / ht ^ 2, 0.016667 * sqrt(wt * ht))]
# Compute something requiring a different formula for different types
# of observations
d[, htAdj := ifelse(sex == 'male', ht, ht * 1.07)] # better" use fifelse in data.table
d[, htAdj := ht * ifelse(sex == 'male', 1, 1.07)]
d[, htAdj := (sex == 'male') * ht + (sex == 'female') * ht * 1.07]
d[sex == 'male', htAdj := ht]
d[sex == 'female', htAdj := ht * 1.07]
d[, htAdj := fcase(sex == 'male', ht, # fcase is in dta.table
sex == 'female', ht * 1.07)]
d[, htAdj := fcase(sex = 'female', ht * 1.07, default = ht)]
# Add label & optional units (better to use upData which works on data tables)
adlab <- function(x, lab, un='') {
label(x) <- lab
if(un != '') units(x) <- un
x
}
d[, maxhr := adlab(maxhr, 'Maximum Heart Rate', '/m')]
# Delete a variable (or use upData)
d[, bsa := NULL]
# Delete two variables
d[, `:=`(bsa=NULL, bmi=NULL)]
d[, .q(bsa, bmi) := NULL]8.3 Recoding Variables
# Group levels a1, a2 as a & b1,b2,b3 as b
d[, x := factor(x, .q(a1,a2,b1,b2,b3),
.q( a, a, b, b, b))]
# Regroup an existing factor variable
levels(d$x) <- list(a=.q(a1,a2), b=.q(b1,b2,b3))
# or
d <- upData(d, levels=list(x=list(a=.q(a1,a2), b=.q(b1,b2,b3))))
# Or manipulate character strings
d[, x := substring(x, 1, 1)] # replace x with first character of levels
# or
levels(d$x) <- substring(levels(d$x), 1, 1)
# Recode a numeric variable with values 0, 1, 2, 3, 4 to 0, 1, 1, 1, 2
d[, x := 1 * (x %in% 1:3) + 2 * (x == 4)]
d[, x := fcase(x %in% 1:3, 1, # fcase is in data.table
x == 4, 2)]
d[, x := fcase(x %between% c(1,3), 1,
x == 4, 2)] # %between% is in data.table
# Recode a series of conditions to a factor variable whose value is taken
# from the last condition that is TRUE using Hmisc::score.binary
# Result is a factor variable unless you add retfactor=FALSE
d[, x := score.binary(x1 == 'symptomatic',
x2 %in% .q(stroke, MI),
death)]
# Same but code with numeric points
d[, x := score.binary(x1 == 'symptomatic',
x2 %in% .q(stroke, MI),
death, # TRUE/FALSE or 1/0 variable
points=c(1,2,10))]
# Or just reverse the conditions and use fcase which stops at the first
# condition met
d[, x := fcase(death, 'death', # takes precedence
x2 %in% .q(stroke, MI), 'stroke/MI', # takes next precedence
x1 == 'symptomatic', 'symptomatic',
default = 'none')]
# Recode from one set of character strings to another using named vector
# A named vector can be subscripted with character strings as well as integers
states <- c(AL='Alabama', AK='Alaska', AZ='Arizona', ...)
# Could also do:
# states <- .q(AL=Alabama, AK=Alaska, AZ=Arizona, ..., NM='New Mexico', ...)
# or do a merge for table lookup (see later)
d[, State := states[state]]
# states are unique, state can have duplicates and all are recoded
d[, State := fcase(state == 'AL', 'Alabama', state='AK', 'Alaska', ...)]
# Recode from integers 1, 2, ..., to character strings
labs <- .q(elephant, giraffe, dog, cat)
d[, x := labs[x]]
# Recode from character strings to integers
d[, x := match(x, labs)]
d[, x := fcase(x == 'elephant', 1,
x == 'giraffe', 2,
x == 'dog', 3,
x == 'cat', 4)]As an example of more complex hierarchical recoding let’s define codes in a nested list.
a <- list(plant =
list(vegetable = .q(spinach, lettuce, potato),
fruit = .q(apple, orange, banana, pear)),
animal =
list(domestic = .q(dog, cat, horse),
wild = .q(giraffe, elephant, lion, tiger)) )
a$plant
$plant$vegetable
[1] "spinach" "lettuce" "potato"
$plant$fruit
[1] "apple" "orange" "banana" "pear"
$animal
$animal$domestic
[1] "dog" "cat" "horse"
$animal$wild
[1] "giraffe" "elephant" "lion" "tiger"
a <- unlist(a)
aplant.vegetable1 plant.vegetable2 plant.vegetable3 plant.fruit1
"spinach" "lettuce" "potato" "apple"
plant.fruit2 plant.fruit3 plant.fruit4 animal.domestic1
"orange" "banana" "pear" "dog"
animal.domestic2 animal.domestic3 animal.wild1 animal.wild2
"cat" "horse" "giraffe" "elephant"
animal.wild3 animal.wild4
"lion" "tiger"
# Pick names of unlist'ed elements apart to define kingdom and type
n <- sub('[0-9]*$', '', names(a)) # remove sequence numbers from ends of names
# Names are of the form kingdom.type; split at .
s <- strsplit(n, '.', fixed=TRUE)
kingdom <- sapply(s, function(x) x[1])
type <- sapply(s, function(x) x[2])
# or: (note \\. is escaped . meaning not to use as wild card)
# .* = wild card: any number of characters
# kingdom <- sub('\\..*', '', n) # in xxx.yyy remove .yyy
# type <- sub('.*\\.', '', n) # in xxx.yyy remove xxx.
names(kingdom) <- names(type) <- a
w <- data.table(kingdom, type, item=a, key=c('kingdom', 'item'))
w kingdom type item
1: animal domestic cat
2: animal domestic dog
3: animal wild elephant
4: animal wild giraffe
5: animal domestic horse
6: animal wild lion
7: animal wild tiger
8: plant fruit apple
9: plant fruit banana
10: plant vegetable lettuce
11: plant fruit orange
12: plant fruit pear
13: plant vegetable potato
14: plant vegetable spinach
# Example table lookups
cat(kingdom['dog'], ':', type['dog'], '\n')animal : domestic
kingdom[.q(dog, cat, spinach)] dog cat spinach
"animal" "animal" "plant"
type [.q(dog, cat, giraffe, spinach)] dog cat giraffe spinach
"domestic" "domestic" "wild" "vegetable"
# But what if there is a plant named the same as an animal?
# Then look up on two keys
w[.('animal', 'lion'), type][1] "wild"
9 Computing Summary Statistics
Many applications can use the automatically created data.table object .SD which stands for the data table for the current group being processed. If .SDcols were not specified, all variables would be attempted to be analyzed. Specify a vector of variable names as .SDcols to restrict the analysis. If there were no by variable(s), .SD stands for the whole data table.
# Compute the number of distinct values for all variables
nd <- function(x) length(unique(x))
d[, sapply(.SD, nd)] bhr basebp basedp pkhr sbp dp dose maxhr
69 94 441 105 142 508 7 103
pctMphr mbp dpmaxdo dobdose age gender baseEF dobEF
78 132 484 8 62 2 54 60
chestpain restwma posSE newMI newPTCA newCABG death hxofHT
2 2 2 2 2 2 2 2
hxofDM hxofCig hxofMI hxofPTCA hxofCABG any.event ecg
2 3 2 2 2 2 3
# Same but only for variables whose names contain hx and either D or M
d[, sapply(.SD, nd), .SDcols=patterns('hx', 'D|M')]hxofDM hxofMI
2 2
# Compute means on all numeric variables
mn <- function(x) mean(x, na.rm=TRUE)
d[, lapply(.SD, mn), .SDcols=is.numeric] bhr basebp basedp pkhr sbp dp dose maxhr
1: 75.29032 135.3244 10181.31 120.5502 146.9158 17633.84 33.75448 119.3692
pctMphr mbp dpmaxdo dobdose age baseEF dobEF chestpain
1: 78.56989 156 18549.88 30.24194 67.34409 55.60394 65.24194 0.3082437
restwma posSE newMI newPTCA newCABG death hxofHT
1: 0.4605735 0.2437276 0.05017921 0.0483871 0.05913978 0.04301075 0.7043011
hxofDM hxofMI hxofPTCA hxofCABG any.event
1: 0.3691756 0.2759857 0.0734767 0.1577061 0.1594982
# Compute means on all numeric non-binary variables
nnb <- function(x) is.numeric(x) && length(unique(x)) > 2
d[, lapply(.SD, mn), .SDcols=nnb] bhr basebp basedp pkhr sbp dp dose maxhr
1: 75.29032 135.3244 10181.31 120.5502 146.9158 17633.84 33.75448 119.3692
pctMphr mbp dpmaxdo dobdose age baseEF dobEF
1: 78.56989 156 18549.88 30.24194 67.34409 55.60394 65.24194
# Print frequency tables of all categorical variables with > 2 levels
cmult <- function(x) ! is.numeric(x) && length(unique(x)) > 2
tab <- function(x) {
z <- table(x, useNA='ifany')
paste(paste0(names(z), ': ', z), collapse=', ')
}
d[, lapply(.SD, tab), .SDcols=cmult] hxofCig
1: heavy: 122, moderate: 138, non-smoker: 298
ecg
1: normal: 311, equivocal: 176, MI: 71
Tabulate all variables having between 3 and 10 distinct values and create a side effect when data.table is running that makes the summarization function tab store all values and frequencies in a growing list Z so that kable can render a markdown table after we pad columns to the maximum length of all columns (maximum number of distinct values).
# Using <<- makes data.table have a side effect of augmenting Z and
# Align in the global environment
tab <- function(x) {
z <- table(x, useNA='ifany')
i <- length(Z)
Z[[i+1]] <<- names(z)
Z[[i+2]] <<- as.vector(z)
Align <<- c(Align, if(is.numeric(x)) 'r' else 'l', 'r')
length(z)
}
discr <- function(x) { i <- length(unique(x)); i > 2 & i < 11 }
# or i %between% c(2,11)
Z <- list(); Align <- character(0)
w <- d[, lapply(.SD, tab), .SDcols=discr]
maxl <- max(w)
# Pad shorter vectors with blanks
Z <- lapply(Z, function(x) c(x, rep('', maxl - length(x))))
Z <- do.call('cbind', Z) # combine all into columns of a matrix
colnames(Z) <- rep(names(w), each=2)
colnames(Z)[seq(2, ncol(Z), by=2)] <- 'Freq'
knitr::kable(Z, align=Align)| dose | Freq | dobdose | Freq | hxofCig | Freq | ecg | Freq |
|---|---|---|---|---|---|---|---|
| 10 | 2 | 5 | 7 | heavy | 122 | normal | 311 |
| 15 | 28 | 10 | 7 | moderate | 138 | equivocal | 176 |
| 20 | 47 | 15 | 55 | non-smoker | 298 | MI | 71 |
| 25 | 56 | 20 | 73 | ||||
| 30 | 64 | 25 | 71 | ||||
| 35 | 61 | 30 | 78 | ||||
| 40 | 300 | 35 | 62 | ||||
| 40 | 205 |
A better approach is to let the kables function put together a series of separate markdown tables of different sizes. By using the “updating Z in the global environment” side effect we are able to let data.table output any type of objects of non-conformable dimensions over variables (such as frequency tabulations).
tab <- function(x) {
z <- table(x, useNA='ifany')
i <- length(Z)
w <- matrix(cbind(names(z), z), ncol=2,
dimnames=list(NULL, c(vnames[i+1], 'Freq')))
Z[[i+1]] <<- knitr::kable(w, align=c(if(is.numeric(x)) 'r' else 'l', 'r'))
length(z)
}
discr <- function(x) { i <- length(unique(x)); i > 2 & i < 11 }
Z <- list()
vnames <- names(d[, .SD, .SDcols=discr])
w <- d[, lapply(.SD, tab), .SDcols=discr]
knitr::kables(Z)
|
|
|
|
Use a similar side-effect approach to get separate html describe output by gender.
g <- function(x, by) {
Z[[length(Z) + 1]] <<- describe(x, descript=paste('age for', by))
by
}
Z <- list()
by <- d[, g(age, gender), by=gender]
# Make Z look like describe() output for multiple variables
class(Z) <- 'describe'
attr(Z, 'dimensions') <- c(nrow(d), nrow(by))
attr(Z, 'descript') <- 'Age by Gender'
html(Z)2 Variables 558 Observations
age for male: Age years
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 220 | 0 | 51 | 0.999 | 67.86 | 12.91 | 45.95 | 51.00 | 62.00 | 69.00 | 75.00 | 81.00 | 86.00 |
age for female: Age years
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 338 | 0 | 58 | 0.999 | 67.01 | 13.74 | 47.00 | 50.70 | 59.25 | 68.00 | 76.00 | 83.00 | 85.00 |
# Compute a 1-valued statistic on multiple variables, by cross-classification
# of two variables. Do this on a subset. .SDcols=a:b uses variables in order
# Use keyby instead of by to order output the usual way
d[age < 70, lapply(.SD, mean), keyby=.(gender, newMI), .SDcols=pkhr:dp] gender newMI pkhr sbp dp
1: male 0 122.0561 147.7664 17836.24
2: male 1 115.6667 140.6667 16437.67
3: female 0 122.8492 150.5084 18509.03
4: female 1 123.5714 171.5714 21506.71
# Compute multiple statistics on one variable
# Note: .N is a special variable: count of obs for current group
d[, .(Max=max(bhr), Min=min(bhr), Mean=mean(bhr), N=.N), by=.(gender, newMI)] gender newMI Max Min Mean N
1: male 0 127 42 74.15385 208
2: male 1 89 59 71.25000 12
3: female 0 210 45 75.96894 322
4: female 1 108 65 79.43750 16
# Same result another way
g <- function(x) list(Max=max(x), Min=min(x), Mean=mean(x), N=length(x))
d[, g(bhr), by=.(gender, newMI)] # if g returned a vector instead, use as.list(g(bhr)) gender newMI Max Min Mean N
1: male 0 127 42 74.15385 208
2: male 1 89 59 71.25000 12
3: female 0 210 45 75.96894 322
4: female 1 108 65 79.43750 16
d[, as.list(quantile(bhr)), by=gender] gender 0% 25% 50% 75% 100%
1: male 42 63 72 83 127
2: female 45 65 75 85 210
# Compute mean bhr by quintiles of age using Hmisc::cut2
# Bad statistical practice; use scatterplot + smoother instead
d[, .(Mean=mean(bhr)), keyby=.(fifth=cut2(age, g=5))] fifth Mean
1: [26,60) 78.54545
2: [60,67) 76.96907
3: [67,72) 74.26316
4: [72,78) 75.92793
5: [78,93] 70.03846
# Compute multiple statistics on multiple variables
d[, lapply(.SD, quantile), by=gender, .SDcols=.q(bhr, pkhr, sbp)] gender bhr pkhr sbp
1: male 42 52.00 80.00
2: male 63 106.75 120.00
3: male 72 123.00 140.00
4: male 83 136.00 166.00
5: male 127 176.00 250.00
6: female 45 61.00 40.00
7: female 65 106.25 122.25
8: female 75 122.00 141.50
9: female 85 134.00 170.00
10: female 210 210.00 309.00
# Similar but put percentile number in front of statistic value
# Do only quartiles
g <- function(x) {
z <- quantile(x, (1:3)/4, na.rm=TRUE)
paste(format(names(z)), format(round(z)))
}
d[, lapply(.SD, g), by=gender, .SDcols=.q(bhr, pkhr, sbp)] gender bhr pkhr sbp
1: male 25% 63 25% 107 25% 120
2: male 50% 72 50% 123 50% 140
3: male 75% 83 75% 136 75% 166
4: female 25% 65 25% 106 25% 122
5: female 50% 75 50% 122 50% 142
6: female 75% 85 75% 134 75% 170
# To have more control over labeling and to have one row per sex:
g <- function(x) {
s <- sapply(x, quantile, na.rm=TRUE) # compute quantiles for all variables -> matrix
h <- as.list(s) # vectorizes first
# Cross row names (percentile names) with column (variable) names
# paste(b, a) puts variable name in front of percentile
names(h) <- outer(rownames(s), colnames(s), function(a, b) paste(b, a))
h
}
d[, g(.SD), by=gender, .SDcols=.q(bhr, pkhr, sbp)] gender bhr 0% bhr 25% bhr 50% bhr 75% bhr 100% pkhr 0% pkhr 25% pkhr 50%
1: male 42 63 72 83 127 52 106.75 123
2: female 45 65 75 85 210 61 106.25 122
pkhr 75% pkhr 100% sbp 0% sbp 25% sbp 50% sbp 75% sbp 100%
1: 136 176 80 120.00 140.0 166 250
2: 134 210 40 122.25 141.5 170 309
# Restrict to variables bhr - basedp in order columns created in data table
d[, g(.SD), by=gender, .SDcols=bhr : basedp] gender bhr 0% bhr 25% bhr 50% bhr 75% bhr 100% basebp 0% basebp 25%
1: male 42 63 72 83 127 85 120.00
2: female 45 65 75 85 210 90 122.25
basebp 50% basebp 75% basebp 100% basedp 0% basedp 25% basedp 50% basedp 75%
1: 130 145 203 5220 7976.25 9483 11183.50
2: 136 150 201 5000 8562.00 10063 11891.25
basedp 100%
1: 16704
2: 27300
# Can put ! in front of a sequence of variables to do the opposite
# To add duplicated means to raw data use e.g.
# d[, Mean := mean(x), by=sex]9.1 Summary Statistics With Marginal Summaries
The cube function in the data.table package will compute all possible marginal statistics. When there is only one by variable, the overall statistic is computed in addition to compute stratified values. When a dimension is being marginalized over, the value of the by variable for that dimension will be NA.
mn <- function(x) as.double(mean(x, na.rm=TRUE))
# as.double ensures consistency of numeric type across groups
Nna <- function(x) sum(! is.na(x))
cube(d, .(Meanbhr = mn(bhr), N = Nna(bhr)), by='gender', id=TRUE) grouping gender Meanbhr N
1: 0 male 73.99545 220
2: 0 female 76.13314 338
3: 1 <NA> 75.29032 558
cube(d, .(Meanbhr = mn(bhr), N = Nna(bhr)), by=.q(gender, hxofMI), id=TRUE) grouping gender hxofMI Meanbhr N
1: 0 male 1 73.22472 89
2: 0 female 0 76.30403 273
3: 0 male 0 74.51908 131
4: 0 female 1 75.41538 65
5: 1 male NA 73.99545 220
6: 1 female NA 76.13314 338
7: 2 <NA> 1 74.14935 154
8: 2 <NA> 0 75.72525 404
9: 3 <NA> NA 75.29032 558
# id=TRUE creates output variable grouping to detail level of marginalization
# It is a binary representation, e.g. if by has 3 variables and a row
# is marginalizing over the first and third variables,
# grouping=binary 101 = 5
# Use groupingsets() to control the marginalizations
# Example: marginalize only one variable at a time
groupingsets(d, .(Meanbhr = mn(bhr), N=Nna(bhr)), by=.q(gender, hxofMI),
sets=list('gender', 'hxofMI'), id=TRUE) grouping gender hxofMI Meanbhr N
1: 1 male NA 73.99545 220
2: 1 female NA 76.13314 338
3: 2 <NA> 1 74.14935 154
4: 2 <NA> 0 75.72525 404
10 Merging Data
Consider a baseline dataset b and a longitudinal dataset L, with subject ID of id.
b <- data.table(id=1:4, age=c(21, 28, 32, 23), key='id')
L <- data.table(id = c(2, 2, 2, 3, 3, 3, 3, 4, 4, 5, 5, 5),
day = c(1, 2, 3, 1, 2, 3, 4, 1, 2, 1, 2, 3),
y = 1 : 12, key='id')
b id age
1: 1 21
2: 2 28
3: 3 32
4: 4 23
L id day y
1: 2 1 1
2: 2 2 2
3: 2 3 3
4: 3 1 4
5: 3 2 5
6: 3 3 6
7: 3 4 7
8: 4 1 8
9: 4 2 9
10: 5 1 10
11: 5 2 11
12: 5 3 12
# Merge b and L to look up baseline age and associate it with all follow-ups
b[L, on=.(id)] # Keeps all ids in L (left inner join) id age day y
1: 2 28 1 1
2: 2 28 2 2
3: 2 28 3 3
4: 3 32 1 4
5: 3 32 2 5
6: 3 32 3 6
7: 3 32 4 7
8: 4 23 1 8
9: 4 23 2 9
10: 5 NA 1 10
11: 5 NA 2 11
12: 5 NA 3 12
L[b, on=.(id)] # Keeps all ids in b (right inner join) id day y age
1: 1 NA NA 21
2: 2 1 1 28
3: 2 2 2 28
4: 2 3 3 28
5: 3 1 4 32
6: 3 2 5 32
7: 3 3 6 32
8: 3 4 7 32
9: 4 1 8 23
10: 4 2 9 23
L[b, on=.(id), nomatch=NULL] # Keeps only ids in both b and L (right outer join) id day y age
1: 2 1 1 28
2: 2 2 2 28
3: 2 3 3 28
4: 3 1 4 32
5: 3 2 5 32
6: 3 3 6 32
7: 3 4 7 32
8: 4 1 8 23
9: 4 2 9 23
uid <- unique(c(b[, id], L[, id]))
L[b[.(uid), on=.(id)]] # Keeps ids in either b or c (full outer join) id day y age
1: 1 NA NA 21
2: 2 1 1 28
3: 2 2 2 28
4: 2 3 3 28
5: 3 1 4 32
6: 3 2 5 32
7: 3 3 6 32
8: 3 4 7 32
9: 4 1 8 23
10: 4 2 9 23
11: 5 1 10 NA
12: 5 2 11 NA
13: 5 3 12 NA
merge(b, L, by='id', all=TRUE) # also full outer join; calls merge.data.table id age day y
1: 1 21 NA NA
2: 2 28 1 1
3: 2 28 2 2
4: 2 28 3 3
5: 3 32 1 4
6: 3 32 2 5
7: 3 32 3 6
8: 3 32 4 7
9: 4 23 1 8
10: 4 23 2 9
11: 5 NA 1 10
12: 5 NA 2 11
13: 5 NA 3 12
For very large data tables, giving the data tables keys will speed execution, e.g.:
setkey(d, id)
setkey(d, state, city)
Join/merge can be used for data lookups:
s <- data.table(st=.q(AL, AK, AZ, CA, OK), y=5:1)
stateAbbrevs <- data.table(state=state.abb, State=state.name)
s[stateAbbrevs, , on=.(st=state), nomatch=NULL] st y State
1: AL 5 Alabama
2: AK 4 Alaska
3: AZ 3 Arizona
4: CA 2 California
5: OK 1 Oklahoma
11 Reshaping Data
To reshape data from long to wide format, take the L data table above as an example.
w <- dcast(L, id ~ day, value.var='y')
w id 1 2 3 4
1: 2 1 2 3 NA
2: 3 4 5 6 7
3: 4 8 9 NA NA
4: 5 10 11 12 NA
# Now reverse the procedure
m <- melt(w, id.vars='id', variable.name='day', value.name='y')
m id day y
1: 2 1 1
2: 3 1 4
3: 4 1 8
4: 5 1 10
5: 2 2 2
6: 3 2 5
7: 4 2 9
8: 5 2 11
9: 2 3 3
10: 3 3 6
11: 4 3 NA
12: 5 3 12
13: 2 4 NA
14: 3 4 7
15: 4 4 NA
16: 5 4 NA
setkey(m, id, day) # sorts
m[! is.na(y)] id day y
1: 2 1 1
2: 2 2 2
3: 2 3 3
4: 3 1 4
5: 3 2 5
6: 3 3 6
7: 3 4 7
8: 4 1 8
9: 4 2 9
10: 5 1 10
11: 5 2 11
12: 5 3 12
12 Other Manipulations of Longitudinal Data
For the longitudinal data table L carry forward to 4 days the subject’s last observation on y if it was assessed earlier than day 4.
w <- copy(L) # fresh start with no propagation of changes back to L
# Only needed if will be using := to compute variables in-place and
# you don't want the new variables also added to L
# This is related to data.table doing things by reference instead of
# making copies. w <- L does not create new memory for w.
# Compute each day's within-subject record number and last record number
# Feed this directly into a data.table operation to save last records
# when the last is on a day < 4
u <- w[, .q(seq, maxseq) := .(1 : .N, .N), by=id][seq == maxseq & day < 4,]
# Extra observations to fill out to day 4
u <- u[, .(day = (day + 1) : 4, y = y), by=id]
u id day y
1: 2 4 3
2: 4 3 9
3: 4 4 9
4: 5 4 12
w <- rbind(L, u, fill=TRUE)
setkey(w, id, day) # sort and index
w id day y
1: 2 1 1
2: 2 2 2
3: 2 3 3
4: 2 4 3
5: 3 1 4
6: 3 2 5
7: 3 3 6
8: 3 4 7
9: 4 1 8
10: 4 2 9
11: 4 3 9
12: 4 4 9
13: 5 1 10
14: 5 2 11
15: 5 3 12
16: 5 4 12
Find the first time at which y >= 3 and at which y >= 7.
day[y >= 3] is read as “the value of day when y >= 3”. It is a standard subscripting operation in R for two parallel vectors day and y. Taking the minimum value of day satisfying the condition gives us the first qualifying day.L[, .(first3 = min(day[y >= 3]),
first7 = min(day[y >= 7])), by=id] id first3 first7
1: 2 3 Inf
2: 3 1 4
3: 4 1 1
4: 5 1 1
Same but instead of resulting in an infinite value if no observations for a subject meet the condition, make the result NA.
mn <- function(x) if(length(x)) min(x) else as.double(NA)
# as.double needed because day is stored as double precision
# (type contents(L) to see this) and data.table requires
# consistent storage types
L[, .(first3 = mn(day[y >= 3]),
first7 = mn(day[y >= 7])), by=id] id first3 first7
1: 2 3 NA
2: 3 1 4
3: 4 1 1
4: 5 1 1
Add a new variable z and compute the first day at which z is above 0.5 for two days in a row for the subject. Note that the logic below looks for consecutive days for which records exist for a subject. To also require the days to be one day apart add the clause day == shift(day) + 1 after shift(z) > 0.5.
set.seed(1)
w <- copy(L)
w[, z := round(runif(.N), 3)]
u <- copy(w)
u id day y z
1: 2 1 1 0.266
2: 2 2 2 0.372
3: 2 3 3 0.573
4: 3 1 4 0.908
5: 3 2 5 0.202
6: 3 3 6 0.898
7: 3 4 7 0.945
8: 4 1 8 0.661
9: 4 2 9 0.629
10: 5 1 10 0.062
11: 5 2 11 0.206
12: 5 3 12 0.177
mn <- function(x)
if(! length(x) || all(is.na(x))) as.double(NA) else min(x, na.rm=TRUE)
u[, consecutive := z > 0.5 & shift(z) > 0.5, by=id][,
firstday := mn(day[consecutive]), by=id]
u id day y z consecutive firstday
1: 2 1 1 0.266 FALSE NA
2: 2 2 2 0.372 FALSE NA
3: 2 3 3 0.573 FALSE NA
4: 3 1 4 0.908 NA 4
5: 3 2 5 0.202 FALSE 4
6: 3 3 6 0.898 FALSE 4
7: 3 4 7 0.945 TRUE 4
8: 4 1 8 0.661 NA 2
9: 4 2 9 0.629 TRUE 2
10: 5 1 10 0.062 FALSE NA
11: 5 2 11 0.206 FALSE NA
12: 5 3 12 0.177 FALSE NA
In general, using methods that involve counters makes logic more clear, easier to incrementally debug, and easier to extend the condition to any number of consecutive times. Create a function that computes the number of consecutive TRUE values or ones such that whenever the sequence is interrupted by FALSE or 0 the counting starts over. As before we compute the first day at which two consecutive z values exceed 0.5.
nconsec is modified from code found here.nconsec <- function(x) x * sequence(rle(x)$lengths)
# rle in base R: run length encoding
# Example:
x <- c(0,0,1,1,0,1,1,0,1,1,1,1)
nconsec(x) [1] 0 0 1 2 0 1 2 0 1 2 3 4
# To require that the time gap between measurements must be <= 2 time
# units use the following example
t <- c(1:9, 11, 14, 15)
rbind(t=t, x=x) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
t 1 2 3 4 5 6 7 8 9 11 14 15
x 0 0 1 1 0 1 1 0 1 1 1 1
nconsec(x & t <= shift(t) + 2) [1] 0 0 1 2 0 1 2 0 1 2 0 1
u <- copy(w)
# nconsec(z > 0.5) = number of consecutive days (counting current
# day) for which the subject had z > 0.5
u[, firstday := mn(day[nconsec(z > 0.5) == 2]), by=id]
# | | | |
# minimum | | |
# day | |
# such that |
# it's the 2nd consecutive day with z > 0.5
u id day y z firstday
1: 2 1 1 0.266 NA
2: 2 2 2 0.372 NA
3: 2 3 3 0.573 NA
4: 3 1 4 0.908 4
5: 3 2 5 0.202 4
6: 3 3 6 0.898 4
7: 3 4 7 0.945 4
8: 4 1 8 0.661 2
9: 4 2 9 0.629 2
10: 5 1 10 0.062 NA
11: 5 2 11 0.206 NA
12: 5 3 12 0.177 NA
12.1 Overlap Joins
The foverlaps function in data.table provides an amazingly fast way to do complex overlap joins. Our first example is modified from an example in the help file for foverlaps. An annotation column is added to describe what happened.
d1 <- data.table(w =.q(a, a, b, b, b),
start = c( 5, 10, 1, 25, 50),
end = c(11, 20, 4, 52, 60))
d2 <- data.table(w =.q(a, a, b),
start = c(1, 15, 1),
end = c(4, 18, 55),
name = .q(dog, cat, giraffe),
key = .q(w, start, end))
f <- foverlaps(d1, d2, type="any")
ann <- c('no a overlap with d1 5-11 & d2 interval',
'a 10-20 overlaps with a 16-18',
'b 1-4 overlaps with b 1-55',
'b 25-62 overlaps with b 1-55',
'b 50-60 overlaps with b 1-55')
f[, annotation := ann]
f w start end name i.start i.end annotation
1: a NA NA <NA> 5 11 no a overlap with d1 5-11 & d2 interval
2: a 15 18 cat 10 20 a 10-20 overlaps with a 16-18
3: b 1 55 giraffe 1 4 b 1-4 overlaps with b 1-55
4: b 1 55 giraffe 25 52 b 25-62 overlaps with b 1-55
5: b 1 55 giraffe 50 60 b 50-60 overlaps with b 1-55
# Don't include record for non-match
foverlaps(d1, d2, type='any', nomatch=NULL) w start end name i.start i.end
1: a 15 18 cat 10 20
2: b 1 55 giraffe 1 4
3: b 1 55 giraffe 25 52
4: b 1 55 giraffe 50 60
# Require the d1 interval to be within the d2 interval
foverlaps(d1, d2, type="within") w start end name i.start i.end
1: a NA NA <NA> 5 11
2: a NA NA <NA> 10 20
3: b 1 55 giraffe 1 4
4: b 1 55 giraffe 25 52
5: b NA NA <NA> 50 60
# Require the intervals to have the same starting point
foverlaps(d1, d2, type="start") w start end name i.start i.end
1: a NA NA <NA> 5 11
2: a NA NA <NA> 10 20
3: b 1 55 giraffe 1 4
4: b NA NA <NA> 25 52
5: b NA NA <NA> 50 60
Now consider an example where there is an “events” dataset e with 0 or more rows per subject containing start (s) and end (e) times and a measurement x representing a daily dose of something given to the subject from s to e. The base dataset b has one record per subject with times c and d. Compute the total dose of drug received between c and d for the subject. This is done by finding all records in e for the subject such that the interval [c,d] has any overlap with the interval [s,e]. For each match compute the number of days in the interval [s,e] that are also in [c,d]. This is given by min(e,d) + 1 - max(c,s). Multiply this duration by x to get the total dose given in [c,d]. For multiple records with intervals touching [c,d] add these products.
base <- data.table(id = .q(a,b,c), low=10, hi=20)
events <- data.table(id = .q(a,b,b,b,k),
start = c( 8, 7, 12, 19, 99),
end = c( 9, 8, 14, 88, 99),
dose = c(13, 17, 19, 23, 29))
setkey(base, id, low, hi)
setkey(events, id, start, end)
w <- foverlaps(base, events,
by.x = .q(id, low, hi),
by.y = .q(id, start, end ),
type = 'any', mult='all', nomatch=NA)
w[, elapsed := pmin(end, hi) + 1 - pmax(start, low)]
w[, .(total.dose = sum(dose * elapsed, na.rm=TRUE)), by=id] id total.dose
1: a 0
2: b 103
3: c 0
Similar things are can be done with non-equi merges. For those you can require exact subject matches but allow inexact matches on other variables. The following example is modified from here. A medication dataset holds the start and end dates for a patient being on a treatment, and a second dataset visit has one record per subject ID per doctor visit. For each visit look up the drug in effect if there was one.
medication <-
data.table(ID = c( 1, 1, 2, 3, 3),
medication = .q(a, b, a, a, b),
start = as.Date(c("2003-03-25","2006-04-27","2008-12-05",
"2004-01-03","2005-09-18")),
stop = as.Date(c("2006-04-02","2012-02-03","2011-05-03",
"2005-06-30","2010-07-12")),
key = 'ID')
medication ID medication start stop
1: 1 a 2003-03-25 2006-04-02
2: 1 b 2006-04-27 2012-02-03
3: 2 a 2008-12-05 2011-05-03
4: 3 a 2004-01-03 2005-06-30
5: 3 b 2005-09-18 2010-07-12
set.seed(123)
visit <- data.table(
ID = rep(1:3, 4),
date = sample(seq(as.Date('2003-01-01'), as.Date('2013-01-01'), 1), 12),
sbp = round(rnorm(12, 120, 15)),
key = c('ID', 'date'))
visit ID date sbp
1: 1 2004-06-09 113
2: 1 2006-06-06 147
3: 1 2008-01-16 126
4: 1 2009-09-28 127
5: 2 2003-07-14 138
6: 2 2006-02-15 122
7: 2 2006-06-21 127
8: 2 2009-11-15 101
9: 3 2005-11-03 91
10: 3 2009-02-04 110
11: 3 2011-03-05 125
12: 3 2012-03-24 112
# Variables named in inequalities need to have variables in
# medication listed first
m <- medication[visit, on = .(ID, start <= date, stop > date)]
m ID medication start stop sbp
1: 1 a 2004-06-09 2004-06-09 113
2: 1 b 2006-06-06 2006-06-06 147
3: 1 b 2008-01-16 2008-01-16 126
4: 1 b 2009-09-28 2009-09-28 127
5: 2 <NA> 2003-07-14 2003-07-14 138
6: 2 <NA> 2006-02-15 2006-02-15 122
7: 2 <NA> 2006-06-21 2006-06-21 127
8: 2 a 2009-11-15 2009-11-15 101
9: 3 b 2005-11-03 2005-11-03 91
10: 3 b 2009-02-04 2009-02-04 110
11: 3 <NA> 2011-03-05 2011-03-05 125
12: 3 <NA> 2012-03-24 2012-03-24 112
# start and stop dates are replaced with actual date of visit
# drop one of them and rename the other
m[, stop := NULL]
setnames(m, 'start', 'date')
m ID medication date sbp
1: 1 a 2004-06-09 113
2: 1 b 2006-06-06 147
3: 1 b 2008-01-16 126
4: 1 b 2009-09-28 127
5: 2 <NA> 2003-07-14 138
6: 2 <NA> 2006-02-15 122
7: 2 <NA> 2006-06-21 127
8: 2 a 2009-11-15 101
9: 3 b 2005-11-03 91
10: 3 b 2009-02-04 110
11: 3 <NA> 2011-03-05 125
12: 3 <NA> 2012-03-24 112
13 Graphics
I make heavy use of ggplot2, plotly and R base graphics. plotly is used for interactive graphics, and the R plotly package provides an amazing function ggplotly to convert a static ggplot2 graphics object to an interactive plotly one. If the user goes to the trouble of adding labels for graphics entities (usually points, lines, curves, rectangles, and circles) those labels can become hover text in plotly without disturbing anything in static graphics. As shown here you can sense whether an html or pdf report is being produced, and for html all ggplot2 objects can be automatically transformed to plotly.
ggplotly extra text appears in front of labels, but the result of ggplotly can be run through Hmisc::ggplotlyr to remove this as shown in the example.Many types of graphs can be created with base graphics, e.g. hist(age, nclass=50) or Ecdf(age) but using ggplot2 for even simple graphics makes it easy to add handle multiple groups on one graph or to create multiple panels for strata using faceting. ggplot2 has excellent default font sizes and axis labeling that works for most sizes of plots.
Here is a prototypical ggplot2 example illustrating many of the features I most often use. Ignore the ggplot2 label attribute if not using plotly. Options are given to the Hmisc label function so that it will retrieve the variable label and units (if present) and format them for axis labels or tables. The formatting takes into account whether html output is being created and plotly is being used.
# Create a vector of formatted labels for all variables in data
# For variables without labels or units use the variable name
# as the label. If html and plotly are not in effect use R's
# regular plotmath notation to typeset labels/units
addCap()
nam <- names(d)
nv <- length(nam)
vlabs <- structure(character(nv), names=nam)
for(n in nam)
vlabs[n] <- label(d[[n]], plot=TRUE, html=ishtml, default=n)
# Define substitutes for xlab and ylab that look up our
# constructed labels.
# Could instead directly use xlab(vlabs['age'])
labx <- function(v) xlab(vlabs[[as.character(substitute(v))]])
laby <- function(v) ylab(vlabs[[as.character(substitute(v))]])
g <-
ggplot(d, aes(x=age, y=bhr, color=gender, label=paste0('dose:', dose))) +
geom_point() + geom_smooth() +
guides(color=guide_legend(title='')) +
theme(legend.position='bottom') + # not respected by ggplotly
labs(caption='Scatterplot of age by basal heart rate stratified by sex') +
labx(age) + laby(bhr)
# or just xlab('Age in years') + ylab('Basal heart rate')
# To put the caption in a different font or size use e.g.
# theme(plot.caption=element_text(family='mono', size=7))
ggplotlyr(g, remove='.*): ') # removes paste0("dose:", dose): # dose is in hover text for each pointFor large datasets the Hmisc package has a function ggfreqScatter that makes it easy to see overlapping points by color coding the frequency of points in each small bin. That way scatterplots scale to very large datasets. Here is an example:
html=TRUE was needed because otherwise axis labels are formatted using R’s plotmath and plotly doesn’t like that.addCap()
set.seed(1)
x <- round(rnorm(2000), 1)
y <- 2 * (x > 1.5) + round(rnorm(2000), 1)
z <- sample(c('a', 'b'), 2000, replace=TRUE)
label(x) <- 'X Variable' # could use xlab() &
label(y) <- 'Y Variable' # ylab() in ggfreqScatter()
g <- ggfreqScatter(x, y, by=z, html=ishtml)
# If variables were inside a data table use
# g <- d[, ggfreqScatter(x, y, by=z, html=ishtml)]
gNow convert the graphic to plotly if html is in effect otherwise stay with ggplot2 output.
addCap()
ggp(g)When you hover the mouse over a point, its frequency pops up.
Many functions in the Hmisc and rms packages produce plotly graphics directly. One of the most unique functions is dotchartpl.
14 Analysis
14.1 Big Picture
For analysis the sky is the limit, but statistical principles should guide every step. Some of the general principles are
- If there is to be a pivotal analysis there should be a statistical analysis plan (SAP) for this analysis that does not allow for many “statistician degrees of freedom.” The plan should be completed before doing any analysis that might inform analysis choices in a way that would bias the results (e.g., bias the estimate of treatment effect or bias standard errors of effects in a model).
- All analyses should be completely reproducible. Explicitly state random number seeds if any random processes (bootstrap, simulation, Bayesian posterior sampling) are involved.
- Exploratory analysis can take place after any needed SAP is completed.
- Stay close to the raw data
- Continuous or ordinal variables should never be dichotomized even for purely descriptive exploratory analysis. For example, computing proportions of patients with disease stratified by quintiles of weight will be both inefficient and misleading.
- Descriptive and inferential statistics should respect the study design. For parallel-group studies, it is not appropriate to compute change from baseline.
- Question whether unadjusted estimates should be presented. If females in the study are older and age is a risk factor for the outcome, what is the meaning of female - male differences unadjusted for age?
- For observational group comparisons, make sure that experts are consulted about which variables are needed to capture selection processes (e.g., confounding by indication) before data acquisition. If data are already collected and do not contain the variables that capture reasons for decisions such as treatment selection, you may do well to find a different project.
- If the study is a parallel-group randomized clinical trial (RCT), presenting descriptive statistics stratified by treatment (“Table 1”) is not helpful, and it is more informative to describe the overall distribution of subjects. Even more helpful is to show how all baseline variables relate to the outcome variable.
- An RCT is designed to estimate relative treatment effectiveness, and since it does not incorporate random sampling from the population, it cannot provide outcome estimates for a single treatment arm that reference the population. Hence uncertainty intervals for per-treatment outcomes are not meaningful, and uncertainty intervals should be presented only for treatment differences. This is facilitated by “half confidence intervals” described below.
- Above the tendency to interchange the roles of independent and dependent variables by presenting a “Table 2” in such a way that stratifies by the outcome. Stratifying (conditioning) on the outcome is placing it in the role of a baseline variable. Instead, show relationships of baseline variables to outcomes as mentioned in the previous point.
- Nonparametric smoothers and estimating in overlapping moving windows are excellent tools for relating individual continuous variables to an outcome.
- Models are often the best descriptive tools because they can account for multiple variables simultaneously. For example, instead of computing proportions of missing values of a variable Y stratified by age groups and sex, use a binary logistic regression model to relate smooth nonlinear age and sex to the probability Y is missing.
14.2 Replacement for Table 1
Analyses should shed light on the unknown and not dwell on the known. In a randomized trial, the distributions of baseline variables are expected to be the same across treatments, and will be the same once \(N\) is large. When apparent imbalances are found, they lead to inappropriate decisions and ignore the fact that apparently counterbalancing factors are not hard to find. What is unknown and new is how the subject characteristics (and treatment) relate to the outcomes under study. While displaying this trend with a nonparametric smoother, one can simultaneously display the marginal distribution of the characteristic using an extended box plot, spike histogram, or rug plot. Here is an example using the Hmisc summaryRc function. Extended box plots show the mean, median, and quantiles that cover 0.25, 0.5, 0.75, and 0.9 of the distribution.
addCap()
par(mfrow=c(2,2))
summaryRc(hospdead ~ age + crea + meanbp + wblc,
ylim=c(.1, .6), data=support, bpplot='top')Let’s take a closer look at extended box plots. Here is an example in which the distribution of continuous variables is shown, stratified by disease group.
addCap()
bpplotM(age + crea + meanbp + wblc ~ dzgroup,
data=support, cex.strip=0.4, cex.means=0.3, cex.n=0.45)This is better done with interactive plots so that one can for example hover over a corner of a box plot and see which quantile that corner represents.
addCap()
s <- summaryM(age + crea + meanbp + wblc ~ dzgroup,
data=support)
options(grType='plotly')
plot(s)14.3 Descriptively Relating One Variable to Another
To understand the relationship between a continuous variable X and an outcome or another variable Y we may estimate the mean, median, and other quantities as a smooth function of X. There are many ways to do this, including
- making a scatter plot if Y is continuous or almost continuous
- stratifying by fixed or variable intervals of X, e.g., summarizing Y by quintiles of X. This is arbitrary, inefficient, and misleading and should never be done.
- using a nonparametric smoother such as
loess - parametrically estimating the mean Y as a function of X using an ordinary linear least squares (OLS) model with a regression spline in X so as to not assume linearity
- likewise but with a logistic regression model if Y is binary
- semiparametrically estimating quantiles of Y as a function of X using quantile regression and a regression spline for X
- semiparametrically estimating the mean, quantiles, and exceedance probabilities of Y as a function of X using an ordinal regression model and a spline in X
- nonparametrically using overlapping moving windows of X that advance by a small amount each time. For each window compute the estimate of the property of Y using ordinary sample estimators (means, quantiles, Kaplan-Meier estimates, etc.). This approach has the fewest assumptions and is very general in the sense that all types of Y are accommodated. The moving estimates need to be smoothed; the R
supsmufunction is well suited for this.
The estimated trend curves depend on the window width and amount of smoothing, but this problem is tiny in comparison with the huge effect of changing how a continuous predictor is binned when the usual non-overlapping strata are created. The idea is assume smooth relationships and get close to the data.
In the following several of the above methods are illustrated to study how serum creatinine of critically ill patients relates to age. Start with a scatterplot that has no problems with ties in the data.
addCap()
with(support, ggfreqScatter(age, crea))Now consider moving estimates, least squares (OLS), ordinal regression (ORM), and quantile regression (QR) estimates, nonparametric loess estimates, and a flexible adaptive survival model. Moving estimates computed on overlapping x-variable windows, moving averages being the oldest example, have the advantage of great flexibility. As long as one has an estimator (mean, median, Kaplan-Meier estimate, etc.) that can be applied to a relatively homogeneous (with respect to x) sample, moving statistics can estimate smooth trends over x. Unless the windows are wide or the sample size is very large so that one can afford to use narrow x windows, the moving statistics will be noisy and need to be further smoothed. The smaller the windows, the larger the amount of smoothing will be needed. To control bias it is generally better to have smaller windows and more after-estimation smoothing.
The function movStats on Github provides two methods for creating moving overlapping windows from x. The default used here creates varying-width intervals in the data space but fixed-width in terms of sample size. It includes by default 15 observations to the left of the target point and 15 to the right, and moves up \(\max(\frac{n}{200}, 1)\) observations for each evaluation of the statistics. These may be overridden by specifying eps and xinc. If the user does not provide a statistical estimation function stat, the mean and all three quartiles are estimated for each window. movStats makes heavy use of the data.table, rms, and other packages. For ordinal regression estimates of the mean and quantiles the log-log link is used in the example below. Moving estimates are shown with and without supsmu-smoothing them.
addCap()
getRs('movStats.r', put='source')
u <- movStats(crea ~ age,
loess=TRUE, ols=TRUE, qreg=TRUE,
orm=TRUE, family='loglog', msmooth='both',
melt=TRUE, data=support, pr='margin')| N | Mean | Min | Max | xinc |
|---|---|---|---|---|
| 997 | 31 | 25 | 31 | 4 |
# pr='margin' causes window information to be put in margin
ggplot(u, aes(x=age, y=crea, col=Type)) + geom_line() +
facet_wrap(~ Statistic) +
xlab('Age') + ylab('Serum Creatinine')Recommended practice for relating a continuous variable to another continuous variable, especially for replacing parts of Table 1 or Table 2, is to use smoothed moving statistics or (1) a spline OLS model to estimate the mean and (2) a spline quantile regression model for estimating quantiles. Here is an example best practice that shows a preferred subset of the estimates from the last plot. melt=TRUE is omitted so we can draw a ribbon to depict the outer quartiles.
addCap()
u <- movStats(crea ~ age, bass=9, data=support)
ggplot(u, aes(x=age, y=`Moving Median`)) + geom_line() +
geom_ribbon(aes(ymin=`Moving Q1`, ymax=`Moving Q3`), alpha=0.2) +
geom_line(aes(x=age, y=`Moving Mean`, col=I('blue'))) +
xlab('Age') + ylab('Serum Creatinine') +
labs(caption='Black line: median\nBlue line: mean\nBand: Q1 & Q3')Let’s describe how white blood count relates to the probability of hospital death, using a binary logistic regression model and moving proportions. The cube root transformation in regression fits is used because of the extreme skewness of WBC. Use 6 knots at default locations on \(\sqrt[3]{\mathrm{WBC}}\). The \(\sqrt[3]{\mathrm{WBC}}\) transformation affects moving statistics only in that mean x-values for plotting are cubes of mean \(\sqrt[3]{\mathrm{WBC}}\) instead of means on the original WBC scale.
addCap()
u <- movStats(hospdead ~ wblc, k=6, eps=20, bass=3,
trans = function(x) x ^ (1/3),
itrans = function(x) x ^ 3,
loess=TRUE, lrm=TRUE, msmooth='both',
melt=TRUE, pr='margin', data=support)| N | Mean | Min | Max | xinc |
|---|---|---|---|---|
| 976 | 40.8 | 30 | 41 | 4 |
ggplot(u, aes(x=wblc, y=hospdead, col=Type)) + geom_line() +
guides(color=guide_legend(title='')) +
theme(legend.position='bottom')The flexibility of the moving statistic method is demonstrated by estimating how age relates to probabilities of death within 1y and within 2y using Kaplan-Meier estimates in overlapping moving windows. Assumptions other than smoothness (e.g., proportional hazards) are avoided in this approach. Here is an example that also uses an flexible parametric method, hazard regression, implemented in the R polspline package, that adaptively finds knots (points of slope change) in the covariate and in time, and products of piecewise linear terms so as to allow for non-proportional hazards. We use far less penalization than is the default for the hare function for demonstration purposes. For this dataset the default settings of penalty and maxdim result in straight lines.
addCap()
u <- movStats(Surv(d.time / 365.25, death) ~ age, times=1:2,
eps=30, bass=9,
hare=TRUE, penalty=0.5, maxdim=30,
melt=TRUE, data=support)
ggplot(u, aes(x=age, y=incidence, col=Statistic)) + geom_line() +
facet_wrap(~ Type) +
ylab(label(u$incidence)) +
guides(color=guide_legend(title='')) +
theme(legend.position='bottom')movStats can also compute stratified non-smoothed estimates when x is discrete. After computing 1- and 2y Kaplan-Meier incidence probability estimates, order disease groups by ascending order of 1-year mortality before plotting.
addCap()
u <- movStats(Surv(d.time / 365.25, death) ~ dzgroup, times=1:2,
discrete=TRUE,
melt=TRUE, data=support)
m1 <- u[Statistic == '1-year', .(dzgroup, incidence)]
i <- m1[, order(incidence)]
u[, dzgroup := factor(dzgroup, levels=m1[i, dzgroup])]
ggplot(u, aes(x=incidence, y=dzgroup, col=Statistic)) + geom_point() +
xlab(label(u$incidence)) + ylab('') +
guides(color=guide_legend(title='')) +
theme(legend.position='bottom')14.4 One Continuous and One Categorical Predictor
It is possible to descriptively estimate trends against more than one independent variables when the effective sample size is sufficient. Trends can be estimated nonparametrically through stratification (when the third variable is categorical) or with flexible regression models allowing the two predictors to interact. In the graphical displays it is useful to keep sample size limitations in certain regions of the space defined by the two predictors in mind, by superimposing spike histograms on trend curves.
Repeat the last example but stratified by disease class. The window is widened a bit because of the reduced sample size upon stratification. Default smoothing is used for hazard regression.
# The Coma stratum has only n=60 so is not compatible with eps=75
# Use varyeps options
addCap()
u <- movStats(Surv(d.time / 365.25, death) ~ age + dzclass, times=1:2,
eps=30,
msmooth='both', bass=8, hare=TRUE,
melt=TRUE, data=support, pr='margin')| N | Mean | Min | Max | xinc | |
|---|---|---|---|---|---|
| ARF/MOSF | 477 | 59.9 | 40 | 61 | 2 |
| COPD/CHF/Cirrhosis | 314 | 59.4 | 40 | 61 | 1 |
| Coma | 60 | 50.0 | 40 | 60 | 1 |
| Cancer | 149 | 57.5 | 40 | 61 | 1 |
ggplot(u, aes(x=age, y=incidence, col=dzclass)) + geom_line() +
facet_grid(Type ~ Statistic) +
ylab(label(u$incidence)) +
guides(color=guide_legend(title='')) +
theme(legend.position='bottom')Consider another example with a continuous dependent variable. Use the NHANES dataset that was created for analyzing glycohemoglobin (HbA\(_{\mathrm{1c}}\)) for diabetes screening. Stratify by race/ethnicity
addCap()
getHdata(nhgh)
u <- movStats(gh ~ age + re,
melt=TRUE, data=nhgh, pr='margin')| N | Mean | Min | Max | xinc | |
|---|---|---|---|---|---|
| Mexican American | 1366 | 31.0 | 25 | 31 | 6 |
| Other Hispanic | 706 | 30.9 | 25 | 31 | 3 |
| Non-Hispanic White | 3117 | 31.0 | 25 | 31 | 15 |
| Non-Hispanic Black | 1217 | 31.0 | 25 | 31 | 6 |
| Other Race Including Multi-Racial | 389 | 30.9 | 25 | 31 | 1 |
ggplot(u, aes(x=age, y=gh, col=re)) + geom_line() +
facet_wrap( ~ Statistic) +
ylab(label(nhgh$gh)) +
guides(color=guide_legend(title='', nrow=2)) +
theme(legend.position='bottom')Mimic these results using flexible regression with interaction. Start by estimating the mean. Add spike histograms to estimated trend curves. Spike heights are proportional to the sample size in age/race-ethnicity groups after binning age into 100 bins. Direct plotly plotting is used. The user can click on elements of the legend (including the histograms) to turn their display off and on.
addCap()
require(rms)
options(prType='html') # needed to use special formatting (can use prType='latex')
dd <- datadist(nhgh); options(datadist='dd')
f <- ols(gh ~ rcs(age, 5) * re, data=nhgh)
# fontsize will be available for print(anova()) in rms 6.3-1
makecolmarg(anova(f), dec.ms=2, dec.ss=2, fontsize=0.6)
Analysis of Variance for gh
|
|||||
| d.f. | Partial SS | MS | F | P | |
|---|---|---|---|---|---|
| age (Factor+Higher Order Factors) | 20 | 878.06 | 43.90 | 55.20 | <0.0001 |
| All Interactions | 16 | 42.55 | 2.66 | 3.34 | <0.0001 |
| Nonlinear (Factor+Higher Order Factors) | 15 | 61.26 | 4.08 | 5.13 | <0.0001 |
| re (Factor+Higher Order Factors) | 20 | 169.42 | 8.47 | 10.65 | <0.0001 |
| All Interactions | 16 | 42.55 | 2.66 | 3.34 | <0.0001 |
| age × re (Factor+Higher Order Factors) | 16 | 42.55 | 2.66 | 3.34 | <0.0001 |
| Nonlinear | 12 | 14.62 | 1.22 | 1.53 | 0.1051 |
| Nonlinear Interaction : f(A,B) vs. AB | 12 | 14.62 | 1.22 | 1.53 | 0.1051 |
| TOTAL NONLINEAR | 15 | 61.26 | 4.08 | 5.13 | <0.0001 |
| TOTAL NONLINEAR + INTERACTION | 19 | 101.38 | 5.34 | 6.71 | <0.0001 |
| REGRESSION | 24 | 937.86 | 39.08 | 49.13 | <0.0001 |
| ERROR | 6770 | 5384.94 | 0.80 | ||
# Normal printing: anova(f) or anova(f, dec.ms=2, dec.ss=2)
hso <- list(frac=function(f) 0.1 * f / max(f),
side=1, nint=100)
# Plot with plotly directly
plotp(Predict(f, age, re), rdata=nhgh, histSpike.opts=hso)Now use quantile regression to estimate quartiles of glycohemoglobin as a function of age and race/ethnicity.
addCap()
f1 <- Rq(gh ~ rcs(age, 5) * re, tau=0.25, data=nhgh)
f2 <- Rq(gh ~ rcs(age, 5) * re, tau=0.5, data=nhgh)
f3 <- Rq(gh ~ rcs(age, 5) * re, tau=0.75, data=nhgh)
p <- rbind(Q1 = Predict(f1, age, re, conf.int=FALSE),
Median = Predict(f2, age, re, conf.int=FALSE),
Q3 = Predict(f3, age, re, conf.int=FALSE))
ggplot(p, histSpike.opts=hso)14.5 Another Replacement for Table 1
We can create a matrix of plots that respect continuous baseline variables while staying close to the data through the use of overlapping moving windows. In the following example we compute moving 1y and 2y mortality for selected continuous baseline variables in support and stack them together. Flexible HARE hazard regression estimates are also included.
reptools includes a function varType to determine the continuous/discrete nature of each variable, and other functions that make it easy to extract the list of either continuous variables (conVars) or discrete variables (disVars). varType also has a third classification: non-numeric variables that have too many (by default > 20) distinct values to be considered discrete.
# Exclude outcome variables from consideration (see earlier)
types <- varType(support, exclude=outcomes)
print(types, quote=FALSE)$continuous
[1] age edu scoma meanbp wblc hrt resp temp pafi
[10] alb bili crea sod ph glucose bun urine
$discrete
[1] sex dzgroup dzclass num.co income race adlp adls
Let’s use only the first 9 continuous variables. In addition to showing all the estimated relationships with the outcome, put covariate distributions in collapsed note. Note the bimodality of some of the measurements, and true zero blood pressures for patients having cardiac arrest.
addCap()
V <- types$continuous[1:9]
U <- list()
for(v in V) {
x <- support[[v]]
u <- movStats(Surv(d.time / 365.25, death) ~ x, times=1:2,
eps=30, hare=TRUE, penalty=0.25, maxdim=10,
msmooth='smoothed', bass=8,
melt=TRUE, data=support)
U[[label(x, default=v)]] <- u # stuffs u in an element of list U
# & names the element w/ var label/name
}
w <- rbindlist(U, idcol='vname') # stack all the data tables
ggplot(w, aes(x, y=incidence, col=Statistic, linetype=Type)) + geom_line() +
facet_wrap(~ vname, scales='free_x') +
ylab(label(u$incidence)) + xlab('') +
guides(color=guide_legend(title='')) +
theme(legend.position='bottom',
strip.text = element_text(size=8))
makecnote(`Covariate Distributions` ~ plot(describe(support[, ..V])))If we were not showing main graphs in wide format (using a Quarto callout) we could have put the marginal distributions in the right margin using the following, which shrinks the plotly output.
require(plotly) # for %>%
pl <- plot(describe(support[, ..V])) %>%
layout(autosize=FALSE, width=350, height=325)
makecolmarg(~ pl)Likewise we can produce a graph summarizing how categorical baseline variables relate to the study outcome variable.
addCap()
V <- types$discrete # or disVars(support, exclude=...)
U <- list()
for(v in V) {
x <- support[[v]]
u <- movStats(Surv(d.time / 365.25, death) ~ x, times=1:2,
discrete=TRUE, melt=TRUE, data=support)
U[[label(x, default=v)]] <- u
}
w <- rbindlist(U, idcol='vname') # stack the tables
ggplot(w, aes(x=incidence, y=x, col=Statistic)) + geom_point() +
facet_wrap(~ vname, scales='free_y') +
xlab(label(u$incidence)) + ylab('') +
guides(color=guide_legend(title='')) +
theme(legend.position='bottom')Alternatively we can put each variable in a separate tab:
gg <- function(data)
ggplot(data, aes(x=incidence, y=x, col=Statistic)) + geom_point() +
xlab('Mortality') + ylab('') +
guides(color=guide_legend(title='')) +
theme(legend.position='bottom')
g <- lapply(U, gg) # one ggplot per element (a data table) in U
maketabs(g, baselabel='catkmtabs')14.6 Confidence Bands for Differences
Studies almost never randomly sample from a population, hence inference to the population for a single treatment’s outcome should seldom be attempted. The uncertainty intervals and bands that should be presented are ones having inferential meaning and are based on treatment differences. One can easily construct a graph that shows differences and confidence intervals for them, but it is useful to be able to show the individual group estimates along with CIs for the differences. Fortunately, Maarten Boers had the idea of a null bar or null zone. When a confidence interval for a difference is symmetric, the confidence interval includes 0.0 if and only if the midpoint of the two outcome estimates \(\pm \frac{1}{4} \times w\) touches the individual group estimates, where \(w\) is the width of the confidence interval. Null zone/half-width CIs can be put to especially good use in avoiding clutter when displaying Kaplan-Meier plots, and can be graphed using the rms package survplot (static plot) and survplotp (plotly interactive graphic) functions. The latter has the additional advantage of providing continuous data on number of subjects still at risk by hovering over the survival curve for one group. Here is an example using support. Estimate survival differences between patients who were or were not able to be interviewed for determining their baseline activities of daily living.
Cumulative incidence are recommended over cumulative survival probabilities, principally because many journals will force you to scale the \(y\)-axis for survival probability as \([0,1]\) even in a very low-risk sample, whereas journals do not have silly scaling conventions for cumulative incidence.
addCap() # add to table of figures
require(rms)
s <- support[, .(d.time = d.time / 365.25,
death,
interviewed = ifelse(is.na(adlp), 'not interviewed',
'interviewed'))]
units(s$d.time) <- 'year'
# Compute nonparametric Kaplan-Meier estimates (uses survival::survfit)
f <- npsurv(Surv(d.time, death) ~ interviewed, data=s)
survplotp(f, fun=function(y) 1. - y)Hovering over the curves reveals the continuous number at risk. Non-recommended individual survival curves can be turned on by clicking in the legend, and the null zone bands can be turned off. Note strong evidence for a difference in mortality early but not as much late.
To produce an additional .pdf graphic in a separate file, with the number of risk shown below the graph, run the following. Add include=FALSE in the chunk header if you want to show no trace of this chunk in the report.
pdf('survplot.pdf')
survplot(f, fun=function(y) 1. - y, conf='diffbands',
ylab='Cumulative Mortality', levels.only=TRUE,
n.risk=TRUE, time.inc=1, label.curves=list(keys='lines'))
dev.off()The adverse event chart in Figure 29 is another example using half-width confidence intervals.
14.7 Formatting
I take advantage of special formatting for model fit objects from the rms package by using html or latex methods and putting results='asis' in the chunk header to preserve the formatting.
require(rms)
options(prType='html') # needed to use special formatting (can use prType='latex')
dd <- datadist(support); options(datadist='dd') # rms needs for summaries, plotting
cr <- function(x) x ^ (1/3)
f <- lrm(hospdead ~ rcs(meanbp, 5) + rcs(age, 5) + rcs(cr(crea), 4), data=support)
fLogistic Regression Model
lrm(formula = hospdead ~ rcs(meanbp, 5) + rcs(age, 5) + rcs(cr(crea),
4), data = support)
Frequencies of Missing Values Due to Each Variable
hospdead meanbp age crea
0 0 0 3
|
Model Likelihood Ratio Test |
Discrimination Indexes |
Rank Discrim. Indexes |
|
|---|---|---|---|
| Obs 997 | LR χ2 184.06 | R2 0.249 | C 0.758 |
| 0 744 | d.f. 11 | R211,997 0.159 | Dxy 0.515 |
| 1 253 | Pr(>χ2) <0.0001 | R211,566.4 0.263 | γ 0.516 |
| max |∂log L/∂β| 8×10-12 | Brier 0.153 | τa 0.195 |
| β | S.E. | Wald Z | Pr(>|Z|) | |
|---|---|---|---|---|
| Intercept | 11.2198 | 2.2071 | 5.08 | <0.0001 |
| meanbp | -0.1020 | 0.0212 | -4.81 | <0.0001 |
| meanbp’ | 0.1929 | 0.1766 | 1.09 | 0.2749 |
| meanbp’’ | -0.1286 | 0.6715 | -0.19 | 0.8481 |
| meanbp’’’ | -0.2329 | 0.6487 | -0.36 | 0.7196 |
| age | -0.0152 | 0.0209 | -0.73 | 0.4678 |
| age’ | 0.0003 | 0.0802 | 0.00 | 0.9968 |
| age’’ | 0.1889 | 0.5300 | 0.36 | 0.7216 |
| age’’’ | -0.5770 | 1.2262 | -0.47 | 0.6380 |
| crea | -6.3192 | 1.8831 | -3.36 | 0.0008 |
| crea’ | 83.8882 | 17.6618 | 4.75 | <0.0001 |
| crea’’ | -196.2020 | 40.5695 | -4.84 | <0.0001 |
makecnote(anova(f)) # in collapsible noteanova(f)
Wald Statistics for hospdead
|
|||
| χ2 | d.f. | P | |
|---|---|---|---|
| meanbp | 78.98 | 4 | <0.0001 |
| Nonlinear | 63.70 | 3 | <0.0001 |
| age | 2.86 | 4 | 0.5817 |
| Nonlinear | 2.78 | 3 | 0.4275 |
| crea | 46.03 | 3 | <0.0001 |
| Nonlinear | 24.52 | 2 | <0.0001 |
| TOTAL NONLINEAR | 90.53 | 8 | <0.0001 |
| TOTAL | 131.57 | 11 | <0.0001 |
Write a function to compute several rms package model summaries and put them in tabs. raw in a formula makes the generated R chunk include output in raw format.
rmsdisplay <- function(f)
maketabs(
` ` ~ ` `,
Model ~ f,
Specs ~ specs(f, long=TRUE) + raw,
Equation ~ latex(f),
ANOVA ~ anova(f) + plot(anova(f)),
ORs ~ plot(summary(f), log=TRUE, declim=2),
`Partial Effects` ~ ggplot(Predict(f)),
Nomogram ~ plot(nomogram(f, fun=plogis, funlabel='P(death)')),
baselabel='lrmtabs', cap=c('', '', '', '', '-', '-', '-', '-'))
rmsdisplay(f)Logistic Regression Model
lrm(formula = hospdead ~ rcs(meanbp, 5) + rcs(age, 5) + rcs(cr(crea),
4), data = support)
Frequencies of Missing Values Due to Each Variable
hospdead meanbp age crea
0 0 0 3
|
Model Likelihood Ratio Test |
Discrimination Indexes |
Rank Discrim. Indexes |
|
|---|---|---|---|
| Obs 997 | LR χ2 184.06 | R2 0.249 | C 0.758 |
| 0 744 | d.f. 11 | R211,997 0.159 | Dxy 0.515 |
| 1 253 | Pr(>χ2) <0.0001 | R211,566.4 0.263 | γ 0.516 |
| max |∂log L/∂β| 8×10-12 | Brier 0.153 | τa 0.195 |
| β | S.E. | Wald Z | Pr(>|Z|) | |
|---|---|---|---|---|
| Intercept | 11.2198 | 2.2071 | 5.08 | <0.0001 |
| meanbp | -0.1020 | 0.0212 | -4.81 | <0.0001 |
| meanbp’ | 0.1929 | 0.1766 | 1.09 | 0.2749 |
| meanbp’’ | -0.1286 | 0.6715 | -0.19 | 0.8481 |
| meanbp’’’ | -0.2329 | 0.6487 | -0.36 | 0.7196 |
| age | -0.0152 | 0.0209 | -0.73 | 0.4678 |
| age’ | 0.0003 | 0.0802 | 0.00 | 0.9968 |
| age’’ | 0.1889 | 0.5300 | 0.36 | 0.7216 |
| age’’’ | -0.5770 | 1.2262 | -0.47 | 0.6380 |
| crea | -6.3192 | 1.8831 | -3.36 | 0.0008 |
| crea’ | 83.8882 | 17.6618 | 4.75 | <0.0001 |
| crea’’ | -196.2020 | 40.5695 | -4.84 | <0.0001 |
lrm(formula = hospdead ~ rcs(meanbp, 5) + rcs(age, 5) + rcs(cr(crea),
4), data = support)
Label Transformation Assumption
meanbp Mean Arterial Blood Pressure Day 3 rcspline
age Age rcspline
crea Serum creatinine Day 3 cr(crea) rcspline
Parameters d.f.
meanbp 47 65.725 78 106 128.05 4
age 33.762 53.801 64.896 72.841 86 4
crea 0.84342 1 1.1447 1.7758 3
meanbp age crea
Low:effect 64.75 51.81099 0.8999023
Adjust to 78.00 64.89648 1.1999512
High:effect 107.00 74.49821 1.8999023
Low:prediction 33.00 22.19788 0.4959985
High:prediction 147.00 91.93815 9.0039844
Low 0.00 18.04199 0.2999878
High 180.00 101.84796 11.7988281
\(\mathrm{crea}\) is pre--transformed as \(\mathrm{cr(crea)}\).
Wald Statistics for hospdead
|
|||
| χ2 | d.f. | P | |
|---|---|---|---|
| meanbp | 78.98 | 4 | <0.0001 |
| Nonlinear | 63.70 | 3 | <0.0001 |
| age | 2.86 | 4 | 0.5817 |
| Nonlinear | 2.78 | 3 | 0.4275 |
| crea | 46.03 | 3 | <0.0001 |
| Nonlinear | 24.52 | 2 | <0.0001 |
| TOTAL NONLINEAR | 90.53 | 8 | <0.0001 |
| TOTAL | 131.57 | 11 | <0.0001 |
15 Caching
The workhorse behind Rmarkdown and Quarto (besides Pandoc) is knitr, which processes the code chunks and properly mingles code and tabular and graphical output. knitr has a built-in caching mechanism to make it so that code is not needlessly executed when the code inputs have not changed. This easy-to-use process does have two disadvantages: the dependencies are not transparent, and the stored cache files may be quite large. I like to take control of caching. To that end, the runifChanged function was written. Here is an example of its use. First a function with no arguments must be composed. This is the (usually slow) function that will be conditionally run if any of a group of listed objects has changed since the last time it was run. This function when needed to be run produces an object that is stored in binary form in a user-specified file (the default file name is the name of the current R code chunk with .rds appended).
# Read the source code for the hashCheck and runifChanged functions from
# https://github.com/harrelfe/rscripts/blob/master/hashCheck.r
getRs('hashCheck.r', put='source')
g <- function() {
# Fit a logistic regression model and bootstrap it 500 times, saving
# the matrix of bootstrapped coefficients
f <- lrm(y ~ x1 + x2, x=TRUE, y=TRUE, data=dat)
bootcov(f, B=500)
}
set.seed(3)
n <- 2000
dat <- data.table(x1=runif(n), x2=runif(n),
y=sample(0:1, n, replace=TRUE))
# runifChanged will write runifch.rds if needed (chunk name.rds)
# Will run if dat or source code for lrm or bootcov change
b <- runifChanged(g, dat, lrm, bootcov)
dim(b$boot.Coef)[1] 500 3
head(b$boot.Coef) Intercept x1 x2
[1,] 0.02007292 -0.30079958 0.32416398
[2,] 0.06150624 -0.35741054 0.25522669
[3,] 0.25225861 -0.40094541 0.09290729
[4,] 0.13766665 -0.48661991 0.19684403
[5,] -0.22018456 0.02132711 0.33973578
[6,] 0.18217417 -0.36140896 -0.04873320
16 Parallel Computing
The runParallel function makes it easy to use available multiprocessor cores to speed up parallel computations especially for simulations. By default it runs the number of available cores, less one. runParallel makes the parallel package easier to use and does recombinations over per-core batches. The user writes a function that does the work on one core, and the same function is run on all cores. This function has set arguments and must return a named list. A base random number seed is given, and the seed is set to this plus i for core number i. The total number of repetitions is given, and this most balanced possible number of repetitions is run on each core to sum to the total desired number of iterations. runifChanged is again used, to avoid running the simulations if no inputs have changed.
# Load runParallel function from github
getRs('runParallel.r', put='source')
# Function to do simulations on one core
run1 <- function(reps, showprogress, core) {
cof <- matrix(NA, nrow=reps, ncol=3,
dimnames=list(NULL, .q(a, b1, b2)))
for(i in 1 : reps) {
y <- sample(0:1, n, replace=TRUE)
f <- lrm(y ~ X)
cof[i, ] <- coef(f)
}
list(coef=cof)
}
# Debug one core run, with only 3 iterations
n <- 300
seed <- 3
set.seed(seed)
X <- cbind(x1=runif(n), x2=runif(n)) # condition on covariates
run1(3)$coef
a b1 b2
[1,] -0.5455330 0.9572896 0.2215974
[2,] -0.2459193 0.3081405 0.1284809
[3,] -0.1391344 -0.2668562 0.1792319
nsim <- 5000
g <- function() runParallel(run1, reps=nsim, seed=seed)
Coefs <- runifChanged(g, X, run1, nsim, seed)
dim(Coefs)[1] 5000 3
apply(Coefs, 2, mean) a b1 b2
0.0020121803 -0.0007277216 -0.0003258610
17 Simulation
Some of the best ways to validate an analysis are
- If using any model/feature selection methods use the bootstrap to check whether the selection process is volatile, e.g., your sample size isn’t large enough too support making hard-and-fast selections of predictors/features
- Use Monte Carlo simulation to check if the correct model or correct predictors are usually selected
- Simulate a large dataset under a known model and known parameter values and make sure the estimation process you use can recover the true parameter values
- Simulate the statistical performance of a method under a variety of conditions
Unlike static papers in the literature, simulation can study the performance of a method under conditions that mimic your situation.
When simulating performance of various quantities under various conditions, creating a large number of variables makes the code long and tedious. It is better to to use data frames/tables or arrays to hold everything together. Data frames and arrays also lead to efficient graphics code for summarization.
17.1 Data Table Approach
The expand.grid function is useful for generating all combinations of simulation conditions. Suppose we wanted to simulate statistical properties of the maximum absolute value of the sample correlation coefficient from a matrix of all pairwise correlations from truly uncorrelated variables. We do this while varying the sample size n, the number of variables p, and the type of correlation (Pearson’s or Spearman’s, denoted by r and rho). With expand.grid we don’t need a lot of nested for loops. Run 500 simulations for each condition.
nsim <- 500
R <- expand.grid(n=c(10, 20, 50, 100),
p=c(2, 5, 10, 20),
sim=1 : nsim)
setDT(R)
set.seed(1)
for(i in 1 : nrow(R)) { # took 4s
w <- R[i, ]
n <- w$n
p <- w$p
X <- matrix(rnorm(n * p), ncol=p)
cors <- cor(X)
maxr <- max(abs(cors[row(cors) < col(cors)])) # use upper triangle
cors <- cor(X, method='spearman')
maxrho <- max(abs(cors[row(cors) < col(cors)]))
set(R, i, 'maxr', maxr) # set is in data.table & is very fast
set(R, i, 'maxrho', maxrho) # set will create the variable if needed
# If not using data.table use this slower approach:
# R[i, 'maxr'] <- maxr etc.
}The simulations could have been cached or parallelized as discussed above.
Compute the mean (over simulations) maximum correlation (over variables) and plot the results.
addCap()
w <- R[, .(maxr = mean(maxr), maxrho=mean(maxrho)), by=.(n, p)]
# Make data table taller and thinner to put r, rho as different observations
u <- melt(w, id.vars=c('n', 'p'), variable.name='type', value.name='r')
u[, type := substring(type, 4)] # remove "max"
ps <- c(2, 5, 10, 20)
u[, p := factor(p, ps, paste0('p:', ps))]
g <- ggplot(u, aes(x=n, y=r, col=type)) + geom_jitter(height=0, width=2) +
ylim(0, 1) +
facet_wrap(~ p) +
guides(color=guide_legend(title='')) +
ylab('Mean Maximum Correlation Coefficient')
plotly::ggplotly(g)17.2 Array Approach
For large problems, storing results in R arrays is more efficient and doesn’t require duplication of values of n and p over simulations. Once the array is created it can be converted into a data table for graphing.
nsim <- 500
ns <- c(10, 20, 50, 100)
ps <- c(2, 5, 10, 20)
R <- array(NA, dim=c(nsim, length(ns), length(ps), 2),
dimnames=list(NULL,
n = as.character(ns),
p = as.character(ps),
type = c('r', 'rho')))
dim(R)[1] 500 4 4 2
dimnames(R)[[1]]
NULL
$n
[1] "10" "20" "50" "100"
$p
[1] "2" "5" "10" "20"
$type
[1] "r" "rho"
set.seed(1)Note the elegance below in how current simulation results are inserted into the simulation results object R, making use of dimension names as subscripts, except for subscript i for the simulation number, which is a ordinary sequential integer subscript. Were the simulated values vectors instead of a scalar (maxr below), we would have used a statement such as R[i, as.character(n), as.character(p), type, ] <- calculated.vector.
for(i in 1 : nsim) { # took 1s
for(n in ns) {
for(p in ps) {
X <- matrix(rnorm(n * p), ncol=p)
for(type in c('r', 'rho')) {
cors <- cor(X,
method=switch(type, r = 'pearson', rho = 'spearman'))
maxr <- max(abs(cors[row(cors) < col(cors)]))
R[i, as.character(n), as.character(p), type] <- maxr
}
}
}
}There are many other ways to specify cor(X, method=...). Here are several other codings for method that will yield equivalent result.
fcase(type == 'r', 'pearson', type == 'rho', 'spearman')
fcase(type == 'r', 'pearson', default='spearman')
c(r='pearson', rho='spearman')[type]
.q(r=pearson, rho=spearman)[type]
if(type == 'r') 'pearson' else 'spearman'
ifelse(type == 'r', 'pearson', 'spearman')# Compute mean (over simulations) maximum correlation for each condition
m <- apply(R, 2:4, mean) # preserve dimensions 2,3,4 summarize over 1
# Convert the 3-dimensional array to a tall and thin data table
# Generalizations of row() and col() used for 2-dimensional matrices
# comes in handy: slice.index
dn <- dimnames(m)
u <- data.table(r = as.vector(m),
n = as.numeric(dn[[1]])[as.vector(slice.index(m, 1))],
p = as.numeric(dn[[2]])[as.vector(slice.index(m, 2))],
type = dn[[3]][as.vector(slice.index(m, 3))])
# If doing this a lot you may want to write a dimension expander function
slice <- function(a, i) {
dn <- all.is.numeric(dimnames(a)[[i]], 'vector') # all.is.numeric in Hmisc
dn[as.vector(slice.index(a, i))]
}
u <- data.table(r = as.vector(m),
n = slice(m, 1),
p = slice(m, 2),
type = slice(m, 3))
# Plot u using same ggplot code as aboveThe result is the same as in Figure 63.
18 Figures
| Figure | Short Caption |
|---|---|
| Figure 1 | NAs/var |
| Figure 2 | NAs/obs |
| Figure 3 | Mean # var NA when other var NA |
| Figure 4 | NAs/var vs mean # other variables NA |
| Figure 5 | Clustering of missingness |
| Figure 6 | NA combinations |
| Figure 7 |
Consort diagram produced with consort_plot
|
| Figure 8 | Consort diagram using component functions |
| Figure 9 |
Consort diagram produced by mermaid
|
| Figure 10 |
Consort diagram produced with mermaid with individual exclusions linked to the overall exclusions node, and with a tooltip to show more detail
|
| Figure 11 | Flowchart of sequential exclusion of observations due to missing values |
| Figure 12 | Continuous |
| Figure 13 | Discrete |
| Figure 14 | Categorical |
| Figure 15 | Continuous |
| Figure 16 | Categorical |
| Figure 17 | Continuous |
| Figure 18 | Categorical Variable Plot |
| Figure 19 | Continuous Variable Plot |
| Figure 20 | Spike histograms of heart rate stratified by ECG category |
| Figure 21 | n=1 |
| Figure 22 | n [2,5) |
| Figure 23 | n [5,8) |
| Figure 24 | n [8,10] |
| ?@fig-multevent | Multi-event chart for 4 patients |
| Figure 27 | State transition proportions |
| Figure 28 | State occupancy proportions by time and male/female |
| Figure 29 | Proportion of adverse events by Treatment |
| Figure 30 | Event chart |
| Figure 31 |
Variable clustering in support
|
| Figure 32 |
plotly translation of a ggplot2 graph making use of variable labels from a data table that are translated to use within-string html font changes
|
| Figure 33 |
ggfreqScatter example, making use of color coded frequencies of points that will work for any size dataset and any number of coincident points
|
| Figure 34 |
plotly version of Figure 33
|
| Figure 35 |
summaryRc function graphics containing a matrix of graphs of nonparametrically estimated relationships between continuous baseline variables and the probability that a patient in the ICU will die in the hospital
|
| Figure 36 |
bpplotM extended box plot examples with stratification by disease group
|
| Figure 37 |
summaryM plotly graphic with interactive extended box plots
|
| Figure 38 |
ggfreqAcatter plot showing all raw data for two continuous variables with only slight binning
|
| Figure 39 |
movStats moving estimates of mean and quantiles of crea as a function of age using small overlapping windows, with and without smoothing of the moving estimates
|
| Figure 40 |
Moving mean and quantile estimates of effect of age with interquartile bands
|
| Figure 41 | Moving estimates of the relationship between white blood count and hospital mortality, using a \(\sqrt[3]{}\) transformation to make the WBC distribution more symmetric |
| Figure 42 |
Moving one minus Kaplan-Meier and HARE estimates estimating the relationship between age and the probability of dying by 1y and by 2y
|
| Figure 43 | Ordinary incidence estimates stratified by disease group, with groups ordered by 1-year mortality estimates |
| Figure 44 | Moving Kaplan-Meier (smoothed and unsmoothed) and HARE estimates of the age effect on time to death, stratified by disease class |
| Figure 45 | Moving estimates of effect of age on glycohemoglobin stratified by race/ethnicity |
| Figure 46 | Predicted mean glycohemoglobin as a function of age and race/ethnicity, with age modeled as a restricted cubic spline with 5 default knots, and allowing the shape of the age effect to be arbitrarily different for the race/ethnicity groups |
| Figure 47 |
Smooth age effects on three quartiles of r hba1c
|
| Figure 48 |
Moving Kaplan-Meier and HARE estimates of a series of continuous covariate effects stacked into one ggplot2 graphic
|
| Figure 49 | Kaplan-Meier estimates of 1y and 2y incidence stratified separately by a series of discrete predictors |
| Figure 50 | sex |
| Figure 51 | dzgroup |
| Figure 52 | dzclass |
| Figure 53 | number of comorbidities |
| Figure 54 | income |
| Figure 55 | race |
| Figure 56 | ADL Patient Day 3 |
| Figure 57 | ADL Surrogate Day 3 |
| Figure 58 | Interactive survival curves with half-width confidence bands |
| Figure 59 | ANOVA |
| Figure 60 | ORs |
| Figure 61 | Partial Effects |
| Figure 62 | Nomogram |
| Figure 63 | Simulation results for estimating the expected value of the maximum absolute correlation coefficient for a varying number \(p\) of variables and varying sample size when all true correlations are zero |
19 Other Resources
- Software development resources for data scientists by ISABELLA VELÁSQUEZ
20 Computing Environment
The following output is created by the command markupSpecs$html$session(), where markupSpecs is defined in the Hmisc package.
R version 4.2.0 (2022-04-22) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Pop!_OS 22.04 LTS Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0 LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] polspline_1.1.19 kableExtra_1.3.4 knitr_1.37 rms_6.3-1 [5] SparseM_1.81 cluster_2.1.3 consort_1.0.1 plotly_4.10.0 [9] Hmisc_4.7-1 ggplot2_3.3.5 Formula_1.2-4 survival_3.3-1 [13] lattice_0.20-45 data.table_1.14.2To cite R in publications use:
R Core Team (2022). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/.
To cite the Hmisc package in publications use:Harrell Jr F (2022). Hmisc: Harrell Miscellaneous. R package version 4.7-1, https://hbiostat.org/R/Hmisc/.
To cite the rms package in publications use:Harrell Jr FE (2022). rms: Regression Modeling Strategies. https://hbiostat.org/R/rms/, https://github.com/harrelfe/rms.
To cite the data.table package in publications use:Dowle M, Srinivasan A (2021). data.table: Extension of 'data.frame'. R package version 1.14.2, https://CRAN.R-project.org/package=data.table.
To cite the ggplot2 package in publications use:Wickham H (2016). ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. ISBN 978-3-319-24277-4, https://ggplot2.tidyverse.org.
To cite the survival package in publications use:Therneau T (2022). A Package for Survival Analysis in R. R package version 3.3-1, https://CRAN.R-project.org/package=survival.